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Abstract 


Representations of sequential data are commonly based on the assumption 
that observed sequences are realizations of an unknown underlying stochas- 
tic process. Usually, the determination of such a representation is construed 
as a learning problem and yields a sequence model. In this context, the model 
must be able to capture the multi-modal nature of the data, without blurring 
between single modes. For modeling the underlying stochastic process, com- 
monly used neural network-based approaches either learn an implicit repre- 
sentation by using stochastic inputs or units, or learn to parameterize a proba- 
bility distribution. As such, these models usually incorporate Monte Carlo or 
other approximation techniques in order to perform parameter estimation and 
probabilistic inference. This even holds true for regression-based approaches 
based on Mixture Density Networks, which still require Monte Carlo simu- 
lation for performing multi-modal inference. Thus, a research gap in fully 
regression-based approaches for parameter estimation and probabilistic in- 
ference emerges. 


Towards this end, this thesis proposes a probabilistic extension to Bézier 
curves (N-Curves), as a basis for effectively modeling continuous-time 
stochastic processes with a bounded index set. The proposed stochastic pro- 
cess model is denoted as the N-Curve model and is based on Mixture Density 
Networks (MDN) and Bézier curves with Gaussian random variables as con- 
trol points. Taking an MDN-based approach is in line with recent attempts 
to address the problem of quantifying uncertainty as a regression problem 
and yields a generic model, which is generally applicable as a basic model 
for probabilistic sequence modeling. Key advantages of the model include 
the ability of generating smooth multi-mode predictions in a single inference 
step, which avoids the need for Monte Carlo simulation. Further, being based 


Abstract 


on Bézier curves, the model can, in theory, be scaled up to high dimensional 
sequence data by embedding the control points in a high dimensional space. 
In order to approach theoretical limitations imposed by the restriction to a 
bounded index set, a conceptual extension to the N-Curve model, capable 
of modeling infinite stochastic processes, is presented. Essential properties 
of the proposed approach and its extension are illustrated by several toy 
examples considering a sequence synthesis task. 


With the original N-Curve model being sufficient for most real-world appli- 
cations, a thorough evaluation is conducted on different multi-step sequence 
prediction tasks for evaluating the capabilities of the model applied to real- 
world data. First, the model is evaluated against commonly used generic prob- 
abilistic sequence models on a human trajectory prediction task, proving the 
capabilities of the N-Curve model, as the model outperforms other the mod- 
els in this comparison. A qualitative evaluation investigates the behavior of 
the model in a prediction context. Further, difficulties in assessing the per- 
formance of probabilistic sequence models in a multi-modal setting are dis- 
cussed. In addition, the model is applied to a human motion prediction task, 
assessing the claimed scalability of the model to higher-dimensional data. In 
this task, the model outperforms commonly used simple and neural network- 
based baselines and performs on par with different state-of-the-art models on 
several occasions, proving its capabilities in this higher-dimensional example. 
Further, difficulties in covariance estimation and the smoothing property of 
the N-Curve model are discussed. 
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Kurzfassung 


Reprasentationen sequenzieller Daten basieren in der Regel auf der Annah- 
me, dass beobachtete Sequenzen Realisierungen eines unbekannten zugrun- 
deliegenden stochastischen Prozesses sind. Die Bestimmung einer solchen Re- 
präsentation wird üblicherweise als Lernproblem ausgelegt und ergibt ein Se- 
quenzmodell. Das Modell muss in diesem Zusammenhang in der Lage sein, die 
multimodale Natur der Daten zu erfassen, ohne einzelne Modi zu vermischen. 
Zur Modellierung eines zugrundeliegenden stochastischen Prozesses lernen 
häufig verwendete, auf neuronalen Netzen basierende Ansätze entweder eine 
Wahrscheinlichkeitsverteilung zu parametrisieren oder eine implizite Reprä- 
sentation unter Verwendung stochastischer Eingaben oder Neuronen. Dabei 
integrieren diese Modelle in der Regel Monte Carlo Verfahren oder andere Nä- 
herungslösungen, um die Parameterschätzung und probabilistische Inferenz 
zu ermöglichen. Dies gilt sogar für regressionsbasierte Ansätze basierend auf 
Mixture Density Netzwerken, welche ebenso Monte Carlo Simulationen zur 
multi-modalen Inferenz benötigen. Daraus ergibt sich eine Forschungslücke 
für vollständig regressionsbasierte Ansätze zur Parameterschätzung und pro- 
babilistischen Inferenz. 


Infolgedessen stellt die vorliegende Arbeit eine probabilistische Erweiterung 
für Bezierkurven (N -Kurven) als Basis für die Modellierung zeitkontinuier- 
licher stochastischer Prozesse mit beschränkter Indexmenge vor. Das vorge- 
stellte Modell, bezeichnet als N-Kurven - Modell, basiert auf Mixture Density 
Netzwerken (MDN) und Bezierkurven, welche Kurvenkontrollpunkte als nor- 
malverteilt annehmen. Die Verwendung eines MDN-basierten Ansatzes steht 
im Einklang mit aktuellen Versuchen, Unsicherheitsschätzung als Regressi- 
onsproblem auszulegen, und ergibt ein generisches Modell, welches allgemein 
als Basismodell für die probabilistische Sequenzmodellierung einsetzbar ist. 
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Kurzfassung 


Ein wesentlicher Vorteil des Modells ist unter anderem die Möglichkeit glat- 
te, multi-modale Vorhersagen in einem einzigen Inferenzschritt zu generieren, 
ohne dabei Monte Carlo Simulationen zu benötigen. Durch die Verwendung 
von Bézierkurven als Basis, kann das Modell außerdem theoretisch für be- 
liebig hohe Datendimensionen verwendet werden, indem die Kontrollpunk- 
te in einen hochdimensionalen Raum eingebettet werden. Um die durch den 
Fokus auf beschränkte Indexmengen existierenden theoretischen Einschrän- 
kungen aufzuheben, wird zusätzlich eine konzeptionelle Erweiterung für das 
N-Kurven - Modell vorgestellt, mit der unendliche stochastische Prozesse 
modelliert werden können. Wesentliche Eigenschaften des vorgestellten Mo- 
dells und dessen Erweiterung werden auf verschiedenen Beispielen zur Se- 
quenzsynthese gezeigt. 


Aufgrund der hinreichenden Anwendbarkeit des N-Kurven - Modells auf 
die meisten Anwendungsfälle, wird dessen Tauglichkeit umfangreich auf ver- 
schiedenen Mehrschrittprädiktionsaufgaben unter Verwendung realer Daten 
evaluiert. Zunächst wird das Modell gegen häufig verwendete probabilisti- 
sche Sequenzmodelle im Kontext der Vorhersage von Fußgängertrajektorien 
evaluiert, wobei es sämtliche Vergleichsmodelle übertrifft. In einer qualitati- 
ven Auswertung wird das Verhalten des Modells in einem Vorhersagekontext 
untersucht. Außerdem werden Schwierigkeiten bei der Bewertung probabi- 
listischer Sequenzmodelle in einem multimodalen Setting diskutiert. Darüber 
hinaus wird das Modell im Kontext der Vorhersage menschlicher Bewegungen 
angewendet, um die angestrebte Skalierbarkeit des Modells auf höherdimen- 
sionale Daten zu bewerten. Bei dieser Aufgabe übertrifft das Modell allgemein 
verwendete einfache und auf neuronalen Netzen basierende Grundmodelle 
und ist in verschiedenen Situationen auf Augenhöhe mit verschiedenen State- 
of-the-Art-Modellen, was die Einsetzbarkeit in diesem höherdimensionalen 
Beispiel zeigt. Des Weiteren werden Schwierigkeiten bei der Kovarianzschät- 
zung und die Glättungseigenschaften des N-Kurven - Modells diskutiert. 
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Notation 


This chapter introduces the notation and symbols which are used in this thesis. 


General notation 


Scalars italic Roman and Greek lowercase letters a,a 
Sets bold calligraphic Roman uppercase letters GD 
Sequences calligraphic Roman uppercase letters S 
Vectors bold Roman lowercase letters x 
Matrices bold Roman uppercase letters R 
Random variables italic Roman uppercase letters X 


Probability Distributions 


N Gaussian normal distribution 

E Gaussian mixture distribution 

PC), gC) probability density functions 

PoC), gg) probability density functions parameterized by 0, & 


Numbers, Indexing and Conventions 


N natural numbers 


xi 


Notation 


Parametric Curves and Sequence Modeling 


Gr 

TT 

P, Py 
Bg(t), Bx(t) 
bi n(t) 

pw 

u(t) 

2’ (t) 

T 
PPPE) 


xii 


natural numbers including zero (non-negative 


integers) 


real numbers 


non-negative real numbers (including zero) 


indexing for points in time 


indexing for objects, measurements and points 


quantities 
zero vector 


identity matrix 


stochastic process with index set T 
index sets 

set of (Gaussian) curve control points 
(probabilistic) parametric curve function 
Bernstein polynomials 

N -Curve (mixture) 

N -Curve mean function 

N -Curve covariance function 

mixture weights 

curve point probability density function 
meta-time index 

curve-time index 

meta-time mapping 

curve-time mapping 

(training/test) dataset 

sequence encoding 


model parameters 


1 Introduction 


Sequential data, or rather timely ordered information, arises in the context of 
many different applications, like for example risk assessment in autonomous 
driving or in data-driven behavior analysis. In general, it is possible to reduce 
the majority of such use-cases to more abstract inference tasks, like sequence 
prediction. With real-world data being subject to noise and detection or an- 
notation errors, the use of a probabilistic sequence model is favorable, as such 
models also take uncertainty in the data into account. 


(a) Trajectory prediction (b) Out-of-distribution detection 


Figure 1.1: Exemplary sequence modeling tasks on different levels of abstraction: 2D trajectory 
prediction in a constraint setting and out-of-distribution detection built upon the de- 
rived probabilistic sequence model when training the prediction model. Both tasks 
can contribute to a superordinate risk assessment application. The prediction task 
(1.1a) is concerned with future trajectory prediction (red, green and blue distribu- 
tions) given an observed trajectory (solid cyan). In such a structured environment, a 
sequence model is learned, which is capable of capturing statistically relevant paths 
through the given scene. As the sequence model provides a model for the underlying 
data distribution, out-of-distribution detection can be performed given a trajectory 
(1.1b). In this example, moving on the pathway is valid under the model, but moving 
onto the grass is highly unlikely. The validity under the model is color-coded from 
red (not valid) to blue (valid). Figure 1.1b is taken from [Har17]. 


1 Introduction 


The determination of such a probabilistic sequence model is commonly layed 
out as a learning problem, where the model parameters are estimated from 
given data samples. This formulation as a learning problem goes along with 
the current dominance of deep learning approaches in a range of differ- 
ent fields related to sequential data. However, working with uncertainties 
and associated probability distributions, most current deep learning-based 
approaches for probabilistic sequence modeling rely on the calculation of 
intractable probability density functions. Because of that, variational or 
sample-based approximations are generally required during training and 
inference in such models. Although there exist regression-based approaches, 
which try to avoid the need for such expensive approximations during 
training, they still require Monte Carlo methods for inference. 


Following this, a common ground for current sequence modeling approaches 
can be observed in their need for Monte Carlo methods during inference. 
Thus, a research gap in regression-based approaches for multi-modal prob- 
abilistic inference emerges. 


Towards this end, the primary goal of this thesis revolves around the formu- 
lation of a fully regression-based probabilistic sequence model. In addition, 
common drawbacks of existing models should be avoided, i.e. 


1 The necessity of Monte Carlo or approximate Bayesian methods 
during either training or inference 


2 Common sequential approaches to sequence generation are often 
uncontrolled and are prone to artifact generation, 


Following this, this thesis proposes a probabilistic extension for parametric 
curves for use in probabilistic sequence modeling and provides an implemen- 
tation of the resulting model based on regression neural networks. The moti- 
vation for basing the approach on parametric curves is driven by the following 
expectations: First, modeling full curves enables instant multi-step inference 
without iteration and the need for Monte Carlo methods. Further, generated 
sequences are constrained by the underlying parametric curves. This, in turn, 
is expected to help stabilizing training. In addition, artifact generation dur- 
ing inference should be mitigated. Finally, modeling a stochastic process in 


1.1 Contributions 


terms of a probabilistic parametric curve yields a compact representation of 
said stochastic process. 


1.1 Contributions 


In compliance with the aforementioned primary objectives, the main contri- 
butions provided in this thesis revolve around a novel probabilistic sequence 
model, built on a probabilistic extension to parametric curves. As such, the 
contributions can be ascribed to three categories: theory, algorithms and eval- 
uation. 


Theory: A probabilistic extension to Bézier curves and Bézier splines capable 
of modeling multi-modal stochastic processes is derived. In this extension, the 
Bézier curve’s control points are assumed to be Gaussian, thus inheriting the 
stochasticity to the curve points by linear combination, resulting in a model 
for a continuous-time stochastic process. Discrete-time stochastic processes 
can be represented by discretizing such a probabilistic curve. Multi-modality 
is achieved, by combining multiple probabilistic curves into a mixture. 


Algorithms: A learning- and regression-based approach for applying these 
probabilistic parametric curves in different sequence modeling tasks, specifi- 
cally synthesis and prediction, is proposed. The approach is based on a Mix- 
ture Density Network, which outputs the parameters for (a mixture of) proba- 
bilistic parametric curves. This enables multi-step sequence generation with- 
out iteration or the need for Monte Carlo methods. Several toy examples 
assess different aspects and qualities of the approach. 


Evaluation: An extensive evaluation of the proposed model is provided for the 
task of human trajectory prediction on real-world datasets. In addition, the 
common approach to evaluation in human trajectory prediction is examined 
with an attempt to provide insight into the suitability of the methodology for 
different task setups. Emphasis is put especially on commonly used perfor- 
mance measures. Finally, scalability of the approach is proven in a higher- 
dimensional scenario given by human motion prediction. 


1 Introduction 


Additional contributions to the field of human trajectory prediction exceeding 
the topical scope of this thesis are given by: 


1 A learning-based normalization for sequential data, which can be used 
as a data preprocessing method [Hug20b]. 


2 An approach for estimating the complexity of a given benchmark 
dataset accompanied by a ranking of commonly used datasets 
[Hug21]. 


3 A complementary benchmark, aiming at a fine-grained evaluation of 
trajectory prediction models, using a hierarchy of tasks [Hug20a]. 


1.2 Outline 


The thesis is structured as follows: Chapter 2 provides a brief overview on the 
most common probabilistic sequence models most state-of-the-art deep learn- 
ing models are built upon. This background chapter also serves the purpose 
of supporting the aforementioned claim for the revealed research gap. Chap- 
ter 3 provides the derivation of a probabilistic extension for Bézier curves 
and Bézier splines, including discussions on choices made for the approach 
and comparisons with related probabilistic sequence models. Closely con- 
nected to Chapter 3 is the proposed implementation of the probabilistic curve 
model given in Chapter 4. Besides implementation details, e.g. the structure of 
the model, several toy examples are provided, assessing different aspects and 
qualities of the model. Chapter 5 provides a real-world evaluation of the pro- 
posed model, using a low-dimensional and a higher-dimensional task, given 
by human trajectory prediction and human motion prediction. Finally, Chap- 
ters 6 and 7 conclude the thesis and give hints to potential future research 
directions. 


2 Sequence Modeling 


In the context of machine learning, the task of sequence modeling is, in general, 
concerned with determining (stochastic) models able to represent, process and 
generate sequential data from a given data basis. When uncertainties about 
the data are taken into account, the sequence model aims to provide an either 
implicit or explicit representation of the underlying probability distribution. 


To enable a more nuanced view on sequence modeling, this general task can 
be subdivided into three closely related sub-tasks, namely sequence encoding, 
synthesis and prediction. While sequence encoding is concerned with reducing 
a given sequence into a compact representation, e.g. a single vector, sequence 
synthesis and prediction aim at generating sequential data. Sequence synthe- 
sis, on the one hand, is concerned with the generation of sequences according 
to an underlying probability distribution, potentially conditioned on a specific 
input. On the other hand, sequence prediction combines both tasks by first 
requiring to encode a given input sequence (the observation) in order to gen- 
erate a prediction for future data points of the observed sequence. As such, 
sequence prediction can be regarded as a variant of conditional sequence syn- 
thesis, where the synthesis model is conditioned on another sequence. Most 
applications in the context of sequence modeling can be ascribed to at least 
one of these three more general inference tasks. A schematic of each task is 
given in Figure 2.1. 
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Figure 2.1: Schematic of sequence modeling sub-tasks sequence encoding, synthesis and predic- 
tion. As an example, a sequence of 2D points is considered. In sequence encoding, 
the sequence model takes in a given sequence and encodes it into a specific repre- 
sentation, e.g. a vector Venc- A sequence synthesis model optionally takes a specific 
input, e.g. a vector Vgen, and generates a sequence. Sequence prediction combines 
the two, as the sequence prediction model needs to encode a sequence it is given 
(green), in order to synthesize a continuation of that sequence (blue). 


With the prevalence of noise and uncertainties in real-world data, statistical 
sequence models are employed for tackling either of the sequence modeling 
tasks. For determining a statistical sequence model, it is assumed that each 
sequence 8 = {x;}+er in a specific dataset is a realization of an unknown 
stochastic process Gr = {X;}er with index set T and random variables X, 
following some probability distribution. Typically, T either corresponds to 
No; RE or some interval [a, b], indicating a discrete-time, continuous-time 
or finite (continuous-time) stochastic process, respectively. Commonly, these 
statistical sequence models are either probabilistic sequence models or stochas- 
tic process models. While the latter are themselves variants of stochastic pro- 
cesses (e.g. Gaussian processes [Ras06]), probabilistic sequence models process 
and generate probability distributions, thus providing a model for the under- 
lying stochastic process. Thereby, the probabilistic sequence model itself can 
be either probabilistic or even deterministic. 


Following this, this thesis focuses on learning-based probabilistic sequence 
models for (conditional) sequence synthesis, including sequence prediction. 
A sequence model then generates a distribution over the sequence to be syn- 
thesized instead of a single (maximum likelihood) sample. The remainder of 
this section provides an overview of the most important probabilistic models 
in this context. Given the prevalence of deep learning-based models among 
current state-of-the-art approaches, the overview is limited to such models 
only. For an overview of machine learning models beyond deep learning, e.g. 
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state space models, such as recursive bayesian estimators [Sär13] or autore- 
gressive models, like the autoregressive moving-average model [Box15], the 
reader may be referred to comprehensive surveys on the topic, e.g. [Rud20b]. 
Although this survey focuses on a prediction task, most mentioned models 
are more universally applicable. 


2.1 Neural Sequence Processing 


To preface the overview, it is important to mention that deep learning-based 
probabilistic sequence models are commonly built around an underlying neu- 
ral sequence model, which is in charge of processing sequences at hand. While 
feed-forward networks (e.g. the Multilayer Perceptron [Mur91]) can be used 
in a setting of fixed length sequences or when applying a sliding window ap- 
proach, dedicated sequence models are usually preferred. Common choices 
for the underlying sequence model are Recurrent Neural Networks (abbrev.: 
RNN, [Rum86]) and their variants, Temporal Convolutional Networks (abbrev.: 
TCN, [Bail8]) and Transformer Networks (abbrev.: TF, [Vas17]). 


2.1.1 Recurrent Neural Networks 


Recurrent Neural Networks are feed-forward networks with additional re- 
current connections along the time axis, enabling it to iteratively process se- 
quences and carry information about past inputs. As such, RNNs, and espe- 
cially its Long Short-Term Memory (abbrev.: LSTM, [Hoc97]) and Gated Re- 
current Unit (abbrev.: GRU, [Cho14]) variants, are widely used. While vanilla 
RNNs are prone to gradient-related problems during training, especially van- 
ishing gradients [Pas13], aforementioned variants incorporate gating mecha- 
nism to cope with such problems. From an operational point of view, RNNs 
are usually build as either 1-to-1 or sequence-to-sequence (abbrev.: seq2seq, 
sometimes also denoted as encoder-decoder, [Sut14]) RNNs. On the one hand, 
a 1-to-1 RNN processes a given sequence one element at a time and generates 
an output at each time step. This approach is generally applicable. Opposed 


2 Sequence Modeling 


to that, seq2seq RNNs are more tailored towards conditional sequence syn- 
thesis, where a given sequence is encoded first using an encoder RNN. The 
resulting encoding is then decoded by another RNN - the decoder - in or- 
der to generate an output sequence. Overall, both variants yield comparable 
performance considering a range of sequence modeling tasks, with the GRU 
performing slightly better in many cases [Chu14][Joz15]. However, when the 
network is built as a sequence-to-sequence model, the LSTM outperforms the 
GRU variant [Bri17]. 


As a final note, due to RNNs employing an autoregressive structure, ie. us- 
ing their own output at time f as input at time t + 1 during inference, tech- 
niques for managing the network input during training should be discussed. 
The most commonly used approach is given by the teacher forcing approach 
[Goo16]. Teacher forcing is a technique for training recurrent neural net- 
works, that, at time t uses the ground truth x; as input, rather than the model’s 
output from the previous time step Y;_ı. As such, the actual network input 
signal is replaced with a teacher signal. This approach helps reaching con- 
vergence faster, at the cost of the network only eventually learning to cope 
with its own imperfect output. A way to tackle this problem, is to start the 
training process using teacher forcing and then slowly transitioning into an 
auto-conditioning scheme, where the actual network output is fed back in the 
subsequent time step [Ben15]. 


2.1.2 Temporal Convolutional Networks 


Temporal Convolutional Networks are a special variant of Convolutional Neu- 
ral Networks (abbrev.: CNN, [LeC95]) for sequential data, popularized by the 
WaveNet model in the context of audio synthesis [Oor16]. The model con- 
sists of dilated causal convolutions. While dilated convolutions [Hol90] are 
incorporated in order to capture long range dependencies, causal convolutions 
[Oor16] ensure that the temporal order of a given sequence is taken into ac- 
count. An advantage of the TCN over the RNN is its inherent parallelism 
on the one hand and a more stable training on the other hand. As the TCN 
processes multiple time steps at once instead of sequentially, convolutions 
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can be done in parallel. A more stable training of the TCN can be attributed 
to more stable gradients. On the downside, the TCN is less flexible in pro- 
cessing sequences of variable length. Although it is possible to process vari- 
able length sequences by sliding the convolutional kernels, the memory of the 
model is limited by the filter kernel’s width and the dilation rate, whereas the 
RNN may, in theory, establish dependencies up to the first sequence element. 
Looking at a range of different sequence modeling tasks, the TCN is able to 
outperform LSTM and GRU models [Bail8] or at least perform similar [Bec18]. 


2.1.3 Transformer Networks 


Transformer Networks originated from the field of natural language modeling 
as a replacement for the commonly used RNN-based sequence-to-sequence 
models. Since its emergence, Transformers also gained traction in other ap- 
plication domains, most notably speech processing, where Transformers con- 
sistently outperform RNN-based models [Kar19, Wan21]. Compared to RNNs, 
which process sequences recursively, the Transformer aims to get rid of re- 
currence and always considers the entire input sequence. As such, the most 
important concept Transformers are build around are positional encoding and 
attention. While the positional encoding enriches input sequence elements 
with information about their position within the given sequence, the atten- 
tion mechanism is in charge of determining which parts of the input sequence 
are of importance for the calculation of each element in the target sequence. 
Further, using an attention mechanism, enriched information is available for 
sequence generation, when compared to RNN-based sequence-to-sequence 
models, where the sequence decoder is only provided with an encoded rep- 
resentation of the input sequence. Besides that, Transformers are in general 
more stable during training, but also seem to be more prone to overfitting, 
which indicates problems with generalization [Zey19]. Additionally, in its 
original formulation, the Transformer model is restricted to fixed-length se- 
quences. This restriction, is tackled by the Transformer-XL extension [Dai19], 
which re-introduces a notion of recurrence and extends on the positional en- 
coding concept. 
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2.2 Probabilistic Sequence Models 


This section provides an overview of deep learning-based probabilistic se- 
quence models commonly used as a basis for task-specific model adaptations. 
These models can roughly be put into three categories: Bayesian, regression- 
based and transformative approaches. For each category, the most relevant 
representatives are examined. 


2.2.1 Bayesian Approaches 


In deep learning-based Bayesian approaches, a deterministic neural network 
is turned into a probabilistic model by treating all its parameters as random 
variables. A prominent example for this class of approaches is given by Bayes- 
ian Neural Networks (abbrev.: BNN, [Bis95]). In these models, inference and 
parameter estimation are built around Bayes’ theorem. As such, the neural 
network outputs an arbitrary predictive probability distribution 


plylx.D) = [ plylx, 0") p(6"|@)ae’ (2.1) 
fs] 


by propagating the units’ output distributions through the network. For pa- 
rameter estimation, the posterior distribution 


P(Dy|Dx,8) pO) 


POD = T E,D 


x P(Dy|D0)p8) (2.2) 
of the network parameters, given a set of data samples, needs to be deter- 
mined. Here, O denotes the model parameters, ® the training dataset split 
into input data D, and target data D,, and x and y denote specific input 
and target vectors, respectively. Due to intractable probability distributions 
arising from non-linear transformations, usually either Monte Carlo methods 
[Nea92] or approximate inference is required for both training and inference. 
Common techniques used for approximate inference include variational in- 
ference (also known as Bayesian Backpropagation, [Blu15]), inference based 
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on expectation propagation [Her15] and Monte Carlo Dropout [Gal16]. In or- 
der to extend BNNs for probabilistic sequence modeling, Bayesian Recurrent 
Neural Networks (abbrev.: BRNN, [For17]) were introduced. For the BRNN, 
the variational Bayesian Backpropagation scheme is adapted for Backpropa- 
gation Through Time [Wer90]. 


In summary, BNN-based approaches provide a fully probabilistic framework 
for sequence modeling. In addition to that, major advantages of such models 
are also given by their robustness to overfitting and the ability to provide in- 
formation about model uncertainty. As a drawback, such models are difficult 
to train, due to the requirement of approximate inference making the train- 
ing computationally more intensive and potentially less stable. Further, the 
need for approximate inference also yields a significant computational over- 
head when generating predictions. As a final note, considering the need for 
approximate inference, the Bayesian Perceptron [Hub20] is worth mention- 
ing. The Bayesian Perceptron is a specific novel probabilistic formulation of 
the Perceptron [Ros58], which provides closed-form parameter propagation 
and estimation, thus eliminating the need for Monte Carlo Methods and ap- 
proximate inference. However, a recurrent extension for sequence modeling 
building on this approach is not yet available. 


Despite not being probabilistic sequence models according to the definition 
given earlier in this section, Gaussian process models are worth mentioning in 
the context of deep learning-based Bayesian approaches. This is due to their 
corresponding relationship, in that the function computed by a deep neural 
network is a function drawn from a Gaussian process [Lee18]. Conversely, a 
GP corresponds to a neural network with an infinite number of units in its 
hidden layer [Nea96, Wil97]. 


Gaussian Processes (GP) and Gaussian process regression [Ras06] provide a 
well-established model for probabilistic sequence modeling and especially 
prediction. Given a collection of sample points of a non-linear function 
fC) : R” > R, a mean function m(-) and a covariance function k(-,-) 
(kernel), the GP yields a multivariate Gaussian prior probability distribution 
over function space. The Gaussian distribution can be used to determine a 
conditional predictive distribution over the next element in a sequence given 
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preceding observations. By embedding this 1-step prediction model into 
a sequential Monte Carlo simulation, multiple time steps can be predicted 
[Ell09]. Deep Gaussian Processes (abbrev.: DGP, [Dam13]) extend on the GP 
framework in order to constitute non-Gaussian, and therefore more complex, 
models. A DGP is a hierarchy of multiple GPs using non-linear mappings 
between each layer of the hierarchy. However, the resulting probability 
densities are intractable and thus require an approximate solution, which can 
be achieved e.g. by variational approximation [Cam15]. A special case of 
the DGP, which implements an autoregressive structure comparable to that 
of an RNN, is given by Recurrent Gaussian Processes (abbrev.: RGP, [Mat16]). 
Here, the priors of latent variables in each hidden layer follow an autore- 
gressive structure. Following this, a recurrent variational approximation 
scheme, which uses a state space model-based approach, is introduced for 
inference. Besides having a computation intensive inference scheme, GP- 
based approaches grant good control over generated sequences, by explicitly 
modeling the kernel functions, thus controlling the prior over functions 
representable by the model through a regularization over the entire value 
range. This gives an advantage over most competing neural network-based 
approaches that generate sequences in a mostly unconstrained fashion. It 
should be noted, however, that GP-based approaches are rarely used in most 
application domains currently dominated by deep learning-based models. 


2.2.2 Regression-based Approaches 


One of the main areas of application for neural networks is given by regression 
tasks, due to their ability to learn arbitrary mappings from a given domain into 
a targeted co-domain. As such, neural networks can be used for probabilis- 
tic modeling when treating the task of uncertainty estimation as a regression 
problem. The neural network is then in charge of learning a mapping from 
a given set of samples onto the parameters of a probability distribution es- 
timating the generating distribution. Following this, the negative data log 
likelihood is optimized during training: 


£ = — log po(x). (2.3) 
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The most widely used regression-based neural network for probabilistic mod- 
eling is given by Mixture Density Networks (abbrev.: MDN, [Bis94, Bis06]), 
which map the output of their last layer onto the parameters of a mixture 
distribution. The prevalent choice for the mixture distribution is given by 
the Gaussian distribution, although Laplace distributions have also been used 
with Mixture Density Networks [Bra19]. For building a probabilistic sequence 
model using MDNs, a common choice is the Recurrent Mixture Density Net- 
work (abbrev.: R-MDN) model as proposed in [Gra13]. Here, an MDN is 
stacked on top of an LSTM network. The recurrent structure is then used 
for encoding the observed sequence as well as for generating predictions. 


Compared to Bayesian approaches, using a deterministic model, such 
regression-based approaches are generally much simpler in terms of infer- 
ence and computational cost, while still generating probabilistic output. On 
the downside, these approaches only give a point-estimate for the parameters 
of a preset target probability distribution. This limits the modeling capabil- 
ities of the approach and also does not allow to make assumptions about 
model uncertainty in a direct way. Drawbacks specific to R-MDNs are on 
the one hand given by the fact that generating multi-modal probabilistic 
predictions generally requires expensive Monte Carlo simulation [Hug18]. 
On the other hand, MDNs are prone to mode collapse [Mak19], where the 
model collapses into generating only slight variations of a single mode. 


A more detailed introduction to MDNs is given in Section 4.1. 


2.2.3 Transformative approaches 


Transformative approaches transform samples of a simple probability distri- 
bution into a sample-based representation of a more complex probability dis- 
tribution. As such, transformative approaches combine deterministic neural 
networks with stochastic inputs in order to define a generative model. The 
most important models in this category are given by Variational Autoencoders 
and Generative Adversarial Networks. 
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Variational Autoencoders (abbrev.: VAE, [Kin14]) are a class of deep gener- 
ative models with latent variables. In latent variable models, the unknown 
generating distribution p(x) is modeled in terms of latent variables z ~ pg(z) 
with prior distribution pg(z). According to Bayes’ theorem, p(x) and pg(z) 
are linked by the mappings pg(x|z) (likelihood) and pg(z|x) (posterior), which 
are computed explicitly. Following this, the generally intractable posterior 
needs to be approximated. For this approximation, the VAE follows a varia- 
tional inference approach, approximating pg(z|x) with the variational poste- 
rior qg(z|x). Putting each part of the latent variable model together, in VAEs 
qg(z|x) and pg(x|z) are defined in terms of deterministic neural networks, 
denoted as the recognition model and the generative model. The networks are 
arranged similar to autoencoders [Hin94], with the latent space as the bottle- 
neck. As a consequence, the generative process of the VAE works by trans- 
forming a set of latent variable samples z; ~ pg(z) drawn from the prior 
distribution pg(z) using the generative model pg(x|z). The prior distribution 
Po(z) is commonly defined as N(0,D). Training the VAE is made possible by 
using the variational lower bound (also known as the evidence lower bound) 


log p(x) > —KL(qg(z|x)||Po(Z)) + Egy cx) Hog Pe(«lz)] (2.4) 


in conjunction with the reparameterization trick [Kin14], which enables joint 
gradient-based training of the entire network. It should be noted, that Nor- 
malizing Flows [Rez15] can be used in a VAE in order to replace the learned 
approximate posterior qg(z|x). Normalizing Flows are a chain of invertible 
mappings, that can be used to transform samples of one probability distribu- 
tion into another. In the context of VAEs, Normalizing Flows provide a frame- 
work for building a more flexible and complex variational approximation of 
the posterior qg(z|x) through an iterative procedure [Kin16, Hua18]. 


In order to extend on the concept of VAEs for sequence modeling, two 
approaches have emerged: the Seq2seq Conditional VAE and the Variational 
Recurrent Neural Network. Seq2seq Conditional VAEs [Bow16] build on the 
concept of conditional VAEs (abbrev.: CVAE, [Soh15]), which employ a 
conditional generating distribution p(x|v) conditioned on some input v. 
This results in the conditional latent prior pg(z|v) and conditional mappings 
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Poe(x|z,v) and qg(z|x,v). Following this, for sequence modeling, each sample 
x ~ Po(x\z,v) resembles a full sequence. Further, in the case of sequence 
prediction, a given observed sequence needs to be encoded into v, in order 
to condition the CVAE’s generative model on the given sequence. Hence, 
in seq2seq CVAEs, a CVAE is combined with a seq2seq RNN, where the 
RNN encoder is used to encode the observation into v and the RNN decoder 
resembles pg(x|z,v), generating the target sequence from v and z. Following 
this, the seq2seq CVAE generates a distribution over sequences of a specific 
length. Opposed to the composite approach of seq2seq CVAEs, the Variational 
Recurrent Neural Network (abbrev.: VRNN, [Chu15]) explicitly models the de- 
pendencies between latent variables of subsequent time steps. Following this, 
the VRNN embeds an RNN into a CVAE, which is at time t conditioned on the 
RNN’s previous hidden state h,_,. For sequence synthesis, the VRNN then 
operates as a 1-to-1 model, generating a sequence of probability distributions 
rather than a probability distribution over sequences. 


To summarize, VAE-based probabilistic sequence models provide compara- 
ble modeling capabilities to Bayesian approaches, while eliminating expen- 
sive approximations during inference, due the transformative approach. On 
the downside, because of imperfect reconstructions due to the injected noise 
when generating samples, training results can become less consistent. 


Generative Adversarial Networks (abbrev.: GAN, [Goo14]) are another type of 
generative model, learning an implicit’ model of the unknown generating dis- 
tribution p(x). While the generative model component in GANSs is very sim- 
ilar to that of VAEs in that samples of a simple distribution are transformed 
into a sample-based representation of a more complex distribution using a de- 
terministic neural network, the network structure and approach to estimating 
the parameters of the generative model is vastly different. In order to bypass 
the need to solve or approximate an intractable posterior distribution, GAN 
training is framed as a supervised learning problem, using a combination of 
two neural networks: the generative model itself (denoted as generator) and 


1 Implicit density models do not compute p(x), but allow sampling from the underlying distri- 
bution using the model. 
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a discriminator. Both models are jointly trained playing a zero-sum game, 
where the generator tries to generate samples from the unknown data dis- 
tribution p(x), which the discriminator is incapable of classifying as real or 
fake. As such, the generator G and the discriminator D play the two-player 
minimax game 


£= min max (Exp) [log D(x)] T -z~ p(z) [log a- D(G(z)))]) (2.5) 
V(G,D) 


with value function V(G,D). The generator’s stochastic input distribution 
p(z) is commonly defined as the multivariate standard Gaussian N (0, I). 


Similar to VAE-based approaches, the GAN can be extended for probabilistic 
sequence modeling by combining a conditional variant of the GAN [Mir14] 
with seq2seq RNNs [Yu17, Gup18]. Following this, the conditional genera- 
tor G(z,v) is combined with a seq2seq RNN and the conditional discriminator 
D(x,v) is combined with an RNN sequence encoder. With the similarities to 
the VAE in the generative model, GANs provide similar benefits without the 
need for variational inference during training. Despite this, GANs tend to be 
hard to train because of vanishing gradient problems and the GANs prone- 
ness to mode collapse. While these problems are addressed by variations of 
the GAN building on the Wasserstein distance [Arj17, Gul17] or by incorpo- 
rating multiple versions of the discriminator into the generator’s loss function 
[Met17], balancing parameter updates between the generator and the discrim- 
inator still poses a challenging problem, as the discriminator converges faster 
than the generator on many occasions [Ham20]. 


2.3 Placement of this Thesis 


Looking at the overview of commonly used (probabilistic) sequence models 
for handling sequential data under uncertainty, a research gap in regression- 
based approaches for multi-modal probabilistic inference is revealed. Follow- 
ing this, this thesis aims to provide a fully regression-based probabilistic se- 
quence model with respect to model training and inference using the model. 


16 


2.3 Placement of this Thesis 


The targeted placement of this thesis among other sequence models is given 
in Table 2.1. 


Table 2.1: Targeted placement of this thesis among other (probabilistic) sequence models. 


Approach 
Model Regression (Approximate) Bayesian 
Training Inference Training Inference 

BNN x x 

DGP X x 

VAE X x 

GAN X x 
R-MDN x x 

This thesis x x 


* In a multi-modal setting 
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Throughout this chapter, a probabilistic sequence model for representing 
stochastic processes is formulated, which aims to avoid the necessity of 
Monte Carlo approaches. To achieve this, first and foremost a sequence 
model for fixed-length sequences will be introduced in Section 3.1, covering 
the most application-relevant case. This covers continuous-time as well as 
discrete-time stochastic processes with bounded index set in an unimodal 
and a multi-modal setting. The model is then extended for the representation 
of infinite stochastic processes in Section 3.2 by lifting some conceptual 
limitations present in the former variant of the model. 


The general idea behind the proposed probabilistic sequence model is to cir- 
cumvent Monte Carlo sampling. Therefore, the model needs to represent full 
sequences instead of iteratively building them. Following this, a probabilistic 
extension to a certain type of parametric curves, Bézier curves in this case, 
is derived, granting a suitable representation of sequential data in arbitrary 
dimensions. The probabilistic sequence model is then built on these proba- 
bilistic Bézier curves. 


3.1 The N-Curve Model 


This section proposes a Bézier curve defined by stochastic control points ca- 
pable of describing a continuous-time stochastic process Gr = {X; her ona 
closed range with Gaussian random variables X; ~ N(u,,2,) and index set 
T = [0,1]. This concept is further extended for modeling random variables 
following a Gaussian mixture distribution. 
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Starting with plain Bézier curves, a Bézier curve (e.g. [Pra02, Far02]) of de- 


gree Naes 
N aeg 
Belt) = >) bi Nap OPi (3.1) 
i=0 


is a blended curve constructed as a linear combination of Ng, + 1 d- 
dimensional control points P = {po,Pı, ---- Pn...} using the Bernstein basis 
polynomials [Lor13] 


bin(t) = (Ye N (3.2) 


as blending functions. The Bernstein basis polynomials are non-negative and 
satisfy >), bin(t) = 1. Each curve point x, = Bg(t) is determined by the 
curve’s positional parameter t € [0,1], where t = 0 corresponds to po and 
t = 1 to py,,,, respectively. The positional parameter can also be interpreted 
as a time parameter when looking at the curve points as a timely-ordered 
sequence of points. An example for a 2-dimensional Bézier curve of degree 
Naeg = 4 for t < 0.88 with corresponding Bernstein basis polynomials is 
depicted in Figures 3.1a and 3.1b, respectively. 
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(a) Exemplary 2-dimensional Bézier curve of (b) Bernstein basis polynomials bi a(t) for 
degree N ges = 4 for t < 0.88. 4’th degree blending. 


Figure 3.1: Illustrating the connection between the Bernstein basis polynomials and Bézier curve 
construction. The Bernstein polynomial values control the weighting of control 
points for calculating curve points. The colors of the control points in figure (a) are 
associated with the weight curve of the same color in figure (b). Weights of control 
points for each curve point are dependent on the positional parameter t. Figure (a) 
shows a curve constructed up to t = 0.88, the remainder is indicated as a dashed 
line. Corresponding weights for t = 0.88 are indicated by circular markers in figure 


(b). 


Considering the objective of modeling a stochastic process, the curve points 
along this parametric curve have to be stochastic. A schematic of such a prob- 
abilistic Bézier curve is illustrated in Figure 3.2. Here, Figure 3.2a illustrates 
a discrete 2-dimensional Bézier curve as the starting point. Figure 3.2b indi- 
cates uncertainty associated with each curve point as a shaded region around 
the curve. It has to be noted, that this presentation of uncertainty is for il- 
lustration purposes only. Uncertainties of multiple time steps are overlayed 
while only considering uncertainty orthogonal to the actual curve. Thus, it 
does not reflect the real probability distribution when integrating over the 
positional parameter. 
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(a) Exemplary 2-dimensional Bézier curve of (b) Schematic extension of a Bézier curve 
degree Naeg = 4. incorporating curve point uncertainty. 


Figure 3.2: Illustration of the starting point ((a): discrete Bézier curve) and goal ((b): probabilis- 
tic Bézier curve) of this section for a Bézier curve of degree Ngeg = 4. Uncertainty 
associated with each curve point is indicated by a shaded region around the curve, 
representing O and 20 regions. Note: The presentation of uncertainty is for illus- 
tration purposes only and does not reflect a real probability distribution integrated 
over the curve’s positional parameter. 


In order to define a probabilistic extension for Bézier curves, such that gener- 
ated curve points are stochastic and follow some probability distribution, it is 
necessary for the control points to be stochastic as well. This is due to every 
curve point being a linear combination of the curve’s control points. Thus, 
an important question is given by the choice of a suitable probability distri- 
bution for the control points. A common choice is given by the Gaussian 
distribution, which is commonly used in machine learning and statistics due 
to its mathematical properties. On the one hand, the popularity of the Gaus- 
sian distribution can be explained through the central limit theorem [And10], 
which states that the sum of independent random variables converges towards 
a Gaussian distribution. Further, among all real-valued distributions with a 
given mean and variance, the Gaussian distribution is the distribution of max- 
imum entropy [Con04]. On the other hand, the most notable mathematical 
property for defining a probabilistic Bézier curve is the fact that the linear 
combination of Gaussian random variables is again Gaussian. 


Following this, for describing a stochastic process Gr in terms of a parametric 
curve, each curve point should follow a Gaussian distribution. Thus, a Gaus- 
sian Bézier curve ı), denoted as N-Curve, is proposed. The N-Curve extends 
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on Equation (3.1) and defines the control points Py = {R, P, PN ag} to fol- 
low a Gaussian distribution with P; ~ N(u,,2;) VP; E Py. The set of mean 
vectors is denoted as Ug = {Mọ, My, 9 MN eg} and the set of covariance ma- 
trices ip = {Xo, 3: ..., EN ieg }, respectively. Thus, the N-Curve is defined by 
a tuple Y = Py = (Ug, Xg). As curve points are defined through a lin- 
ear combination of the control points, the stochasticity is inherited from the 
control points to the curve points {Xy}rero,ıj- This is due to the fact, that for 
AX + BY with X ~ N(ux, Zx) and Y ~ N(p,,2,) follows’ 


AX +BY ~ N(Apy + Buy, AZ, A? + BEB’). 


Thus, the curve function 


By(t, Y) = (uP (t), =? (t)) (3.3) 
with 
Naeg 
eos) Di Neg (Mi (3.4) 
i=0 
and 
Naeg 2 
Z(t) = >) (bin, ©) Zi (3.5) 
i=0 


defines the parameters of a (multivariate) Gaussian probability distribution for 
eacht € [0, 1]. Each d-dimensional curve point X; then follows the respective 
Gaussian distribution N(u?(t), EY (t)) at index t. The Gaussian probability 


* Following the definition as provided in The Matrix Cookbook [Pet08]. 
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density at index t is given by 
BE (x) = p (x1u*),2*(O) 
= N (x|u?(t), ZYH) 


= — exp {5 («- 20)" 0)" RO). 
E70) 


(3.6) 


An example for a Gaussian curve point constructed from Gaussian control 
points for t = 0.5 is depicted in Figure 3.3. The intermediate control point P 
influences X, the most, which leads X, to adopt a skewed covariance ellipse. 
Due to the covariance matrices being interpolated, the other control points, 
R and P,, contribute to X, by making the covariance ellipse more spherical. 


84 P ~ Niu, 21) 
O” 
Al B ~ N(M, X2) 
44 
> 
24 
04 
3 l P ~ Noo» Zo) 
T6 Br 2 ð 
x 


Figure 3.3: Example for a Gaussian curve point X; on an N-Curve with 3 Gaussian control 
points for t = 0.5. The covariance matrix of X;’s Gaussian distribution is a combi- 
nation of the control point covariance matrices. 


So far, a stochastic process model for a continuous index set T = [0,1] 
has been defined. In contrast to this, many real-world applications require 
discrete-time stochastic processes handling sequential data. For handling 
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such use-cases, the N-Curve model can be used to model Gaussian distribu- 
tions at N discrete points in time with By(t, p) using N equally distributed 
values for t, yielding a discrete subset 


v 
N-1 


Ty = | lv € {0,...,N - 1} = fti, N} (3.7) 
of the index set T. Thus, each process index (curve parameter) t; € Ty cor- 
responds to its respective sequence index at time i € {1,..., N}. It has to 
be noted, that using equidistant values for t does not necessarily result in 
equidistant curve points. The distribution of the curve points along the curve 
depends on the positions of the control points. This is illustrated in Figure 3.4. 
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41 _| -4 | 
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x x 
(a) Example for equidistant curve points (b) Example for decreasing distance between 
(Ay = A; = A; = Ag = 4; 7 2.3). curve points (A; 7% 3.2, A, & 2.7, 


A> & 2.3, A4 © 1.8 and A; 7 1.4). 


Figure 3.4: Illustration of the impact of control point positioning on the distribution of curve 
points along the curve. Figures (a) and (b) show two exemplary Bézier curves 
with one shifted control point and identical shape. Note that the shape is not im- 
pacted by shifting Pı, as it lies on the straight line between Pp and P2. Curve 
points (black circular markers) are calculated using the same discrete index set 
Tn=s = {0, 0.2, 0.4, 0.6, 0.8, 1} of equally distributed values for t. 


Finally, the Gaussian random variable X,, at time i is given by 
X, N(By(ti. Pd) = NUY (ti), 2° (4) (3.8) 


with R = X;, and PN jeg = Xty as exact start and end conditions. Figure 3.5 
depicts a 2-dimensional example for an N-Curve with 5 control points. The 
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mean curve and control points with respective covariance ellipses are shown 
in Figure 3.5a. Gaussian random variables X, along the N-Curve given dif- 
ferent values for t are illustrated in Figure 3.5b. The influence of the most 
dominant control point for each curve point X, is clearly visible in the covari- 
ances, adapting towards respective control point covariances. Note that the 
parametric curve interpolates the mean vectors of all Gaussian distributions 
through time. 
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(a) Gaussian control points and continuous (b) Gaussian random variables along the 
mean curve of a 2-dimensional N -Curve. N-Curve for t € {0, 0.2, 0.4, 0.6, 0.8, 1}. 


Figure 3.5: Example for modeling a finite discrete-time stochastic process using an N-Curve. 
The stochastic process consists of random variables corresponding to a discrete sub- 
set of T = [0,1]. 


As a final aspect to consider, the N-Curve model can easily be extended for 
modeling multi-modal stochastic processes. While Gaussian probability dis- 
tributions are a sufficient representation for unimodal sequence data, many 
real-world problems require a multi-modal representation. For this, a com- 
mon approach is to use a Gaussian mixture probability distribution 


a (a ei AGE, en.) > (3.9) 
defined by K weighted Gaussian components and probability density function 
K 


K 
PŒ) = >) TN (Klar, Ek), with È, 7p = land x, > 0. (3.10) 
k=1 k=1 
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In the same way, the concept of N-Curves can be extended to a mixture 
Y = (7, {y,..., Yg} of K weighted N-Curves with normalized weights z = 
{771,...,. Tg}. The stochastic curve points X, at index t € T then follow a 
Gaussian mixture distribution 


X, ~ E (7, {Bolt ddken,..g)- (3.11) 
Accordingly, the probability density at t € T induced by Y is given by 


K 


Pex) = È, AEN Œu rt), ZPC), (3.12) 
k=1 


with u?k(t) and ZPk(t) given by the k’th N-Curve, i.e. (u?*(t), EY (t)) = 
By(t, x). Following this, each stochastic curve point can be multi-modal and 
each mode of the modeled stochastic process follows a separate N-Curve. As 
such, the N-Curve mixture provides the evolution of X, along multiple paths 
through time. An example for an N-Curve mixture is depicted in Figure 3.6. 


Figure 3.6: Example for a multi-modal stochastic curve point X; for t = 0.5 given by an N- 
Curve mixture consisting of 2 N-Curves pı and %2. Both pı and tb are defined 
by 3 Gaussian control points. The curve point X; follows a 2-component Gaussian 
mixture distribution. 
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3.1.1 Rationale behind choosing Bézier curves 


Among related parametric curves with alternative formulations (e.g. 
Pythagorean-hodograph curves [Far08]) or basis polynomials (e.g. La- 
grange bases [War79, Jef88] or the power basis [Sto89]), Bézier curves are 
the most widely used type of blended curves in various fields, especially in 
computer-aided design (e.g. [Fit14]), animation (e.g. [Haal8, Izd20]) and 
path planning (e.g. [Jol09, Tha19]). Besides their popularity, Bézier curves 
offer some valuable properties for the N-Curve model. First and foremost, 
Bézier curves are numerically stable, as well as easy to calculate, control and 
manipulate. Every control point contributes to every curve point, which 
makes curve construction more intuitive and reasonable. An example for 
how the manipulation of a single control point impacts the entire curve is 
given in Figure 3.7. 
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(a) 2-dimensional Bézier curve of degree 4. (b) Impact of relocating a single 


(intermediate) control point. 


Figure 3.7: Illustration of global control, i.e. every control point affects every curve point, in 
Bézier curves. The initial curve is depicted in red and the modified curve in green. 


In addition, Bézier curves provide a compact representation of the entire curve 
in terms of a set of control points. This, in turn, allows the description of a 
whole sequence of random variables, requiring only few stochastic control 
points. Further, Bézier curves can be scaled up to higher dimensions easily 
by increasing the dimensionality of the control points. Besides that, Bézier 
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curves provide a commonly used building block for splines, which are seg- 
mented curves consisting of multiple parametric curves. The ability to com- 
bine Bézier curves into splines is relevant for a recurrent extension of the N- 
Curve model as discussed in Section 3.2. Figure 3.8 provides basic examples 
for scalability and Bézier splines. 
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(a) Exemplary 3-dimensional Bezier curve (b) Exemplary Bézier spline with 3 Bezier 
of degree 4. curve segments. 


Figure 3.8: Basic examples for Bézier curve scalability (3.8a) and Bézier curves as a building block 
for Bézier splines consisting of Bézier curve segments (3.8b). Scalability is illustrated 
by increasing the control point dimension to 3, resulting in a 3-dimensional Bézier 
curve. 


In the context of regression-based deep learning approaches to probabilistic 
sequence modeling, a model based on Bézier curves is expected to have a pos- 
itive impact on the training and inference process. Due to the modeled mean 
sequence being constraint by an underlying parametric curve and the omis- 
sion of an iterative generation approach, the generation of outliers within the 
sequence can be avoided. This, in turn, reduces the effect of error propagation 
present in iterative approaches under the presence of outliers. 


3.1.2 A potential caveat: Non-linear Covariance 
Blending 


When combining control points into curve points, the control points are 


weighted using the Bernstein basis polynomials (see Equations (3.4) and 
(3.5)). While the control point mean vectors are linearly interpolated when 
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calculating a curve point mean vector, non-linear weighting is introduced 
for the covariance matrices in Equation (3.5), due to the control points being 
Gaussian random variables. This, in turn, leads to an effect that prevents 
the N-Curve model from maintaining a constant variance along the curve. 
Instead it is scaled down for curve points with O < t < 1, resulting in a 
squeezing effect. 


This effect is easiest to see, taking an N-Curve with 2 control points, resem- 
bling a straight line, as an example. Setting the variance of both control points 
to 1, the variance of intermediate points is parabolic because of the non-linear 
covariance weighting. This is illustrated in Figure 3.9. 


Figure 3.9: Illustration of non-linear variance interpolation using a simple 1-dimensional N- 
Curve with 2 control points. Both control points Pg and P have a variance of 1. 
The shaded region around the curve depicts 1, 2 and 3 times the variance for each 
curve point. It can be seen, that the evolution of the variance along the curve is 
parabolic. 
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Recalling Equation (3.5), in Z#(t) = >, (bin(t)) %;, the Bernstein coeffi- 
cients need to be squared when blending the covariance matrices, due to 


cov[aX] = E[(aX - E[aX])(aX — E[aXx])"] 


= a? - cov[X]. 


(3.13) 


Following this and bj y(t) < 1 for 0 < t < 1, the normalization property 
= CEG < 1 does not hold. Obvious attempts to mitigate this effect 


involve the addition of intermediate control points or an adjustment of in- 
termediate control point variances. First, adding intermediate control points 
with constant variance does only amplify the squeezing effect, due to the in- 
creasing number of weights being involved, leading to >. (bu) < 


>. (bin) for M > N. Second, adjusting intermediate control point 
variances can only mitigate the squeezing effect for selected curve points, 
making it at least viable for discrete-time stochastic processes in theory. The 
effect of both approaches on the variance along the curve is depicted in Figure 
3.10. 
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(a) Exemplary 1-dimensional N-Curve using (b) Exemplary 1-dimensional N-Curve using 
3 intermediate control points with one intermediate control point with 
variance 1. increased variance. 


Figure 3.10: Illustration of different approaches trying to mitigate the squeezing effect in M- 
Curves due to non-linear variance blending. In both subfigures, a simple 1- 
dimensional N-Curve is depicted. Respective first and last control points have a 
variance of 1. The shaded region around the curve depicts 1, 2 and 3 times the 
variance for each curve point. It can be seen, while adding multiple intermediate 
control points with the same variance amplify the squeezing effect, increasing in- 
termediate control point variances can help mitigate the effect. 
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Although, in theory, this effect seems like a major drawback of the N-Curve 
model, especially in the continuous-time case, it is less relevant in practice, 
due to the prevalence of discrete sequence data. In addition, real-world data 
is commonly suspect to noise, which makes the constant variance case dis- 
cussed in this subsection less likely to appear. In order to provide more in- 
sight into this effect and its impact in the context of sequence modeling, it is 
discussed further in the context of experiments conducted on real-world data 
in Section 5.1.5.4. 


3.1.3 The N-Curve Model as a Generative Model 


For modeling a stochastic process, the N-Curve model provides a Gaussian 
probability distribution for each point in time. At the same time, it provides a 
probability distribution over parametric curves. Following this, the N-Curve 
model can be used as a generative model to either generate samples at specific 
points in time or to generate (continuous) realizations of the stochastic process 
itself. The latter can be achieved by sampling a set of Bézier curve control 
points from an N-Curve y, or an N-Curve mixture Y, respectively. In the case 
of N-Curve mixtures, a specific N-Curve to draw a sample from is randomly 
selected according to the weight distribution 7 in a first step. A set of samples 
drawn from an N-Curve and a mixture of N-Curves is depicted in Figure 3.11. 
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(a) Samples for X; along an N-Curve for (b) Samples for X; along an N-Curve 
t € {0, 0.2, 0.4, 0.6, 0.8, 1}. mixture for t € {0, 0.2, 0.4, 0.6, 0.8, 1}. 


(c) Bézier curves sampled from an N-Curve. (d) Bézier curves sampled from an N-Curve 
mixture. 


Figure 3.11: Illustration of the N-Curve model as a generative model for generating data for 
specific points in time along the curve ((a) and (b)) and for generating full Bézier 
curves according to the N-Curve (mixture) control points ((c) and (d)). 


3.1.4 Connection to Gaussian Processes 


Generally speaking, Gaussian processes are a form of stochastic processes, 
where the joint distribution of all stochastic variables {X,};cr is a multivari- 
ate Gaussian distribution. The joint distribution is obtained using an explicit 
mean function and covariance function, commonly referred to as the kernel 
of the Gaussian process (see also Section 2.2.1). Due to the joint distribution 
being Gaussian, each individual stochastic variable, either obtained through 
marginalization or conditioning, is again Gaussian. Following this, a funda- 
mental similarity between the N-Curve model and Gaussian processes can be 
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observed, in that the N-Curve model provides a model for a stochastic pro- 
cess {X,};er comprised of Gaussian random variables X, ~ N(u, £). Thus, 
the question arises, if the underlying N-Curves are a special case of Gaussian 
processes using an implicit covariance function. 


Following the definition of Gaussian processes [Mac03, Ras06], an N-Curve 
would be classified as a Gaussian process, if for any finite subset {ty,..., tk} 
of T, the joint probability density p(X;,,...,X;,) of corresponding random 
variables is Gaussian. This property is referred to as the GP property in the 
following and can be shown to hold true for N-Curves, as these are, in fact, 
an alternative formulation for Gaussian processes with specific mean and co- 
variance functions. 


In order to prove the GP property holding true for N-Curves, first recall, 
that an N-Curve 4% is defined in terms of a set Py of Ngeg independent d- 
dimensional Gaussian control points P; ~ N(;,2;), which are defined as 
column vectors, i.e. 


R=, (3.14) 


Using these control points, a sequence of Gaussian probability distributions 
along the corresponding N -Curve has been defined (see Equation (3.3)). As 
an alternative to this approach, the control points can also be stacked into the 
((Naeg + 1) - d X 1) control point random vector 


p=| ! |, (3.15) 
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consisting of independent Gaussian random variables, which is itself jointly 
Gaussian. Further, a (N - d X (Ngeg + 1) - d) transformation matrix 


Bon.) + BNgegsNaeg (1) 
= : E i l (3.16) 
Bo,Nag (tn) si BN aeg Nag EN) 
with 
Bi, Naet) = Din. Ola» (3.17) 


where Iq is the d-dimensional identity matrix, can be derived using the Bern- 
stein polynomials bi Naeglti) with t; € Ty (see Equation (3.7) for Ty), in order 
to map the control point random vector P onto a random vector consisting of 
N d-dimensional Gaussian curve points, i.e. 


Bong) Pot + Bingen Naeg 61) * Pier X 
X=C-P= : = : 
BoNeeg tn) + + Bing Nae EN) Pac, XN 
(3.18) 


As X is obtained through a linear transformation of a Gaussian random vector, 
it is jointly Gaussian as well. As a consequence, the corresponding probability 
density function p(X) = p(X}, ..., Xy) is a Gaussian probability density, thus 
the GP property holds. 


Next, the mean function and Gaussian process kernel induced by a given 
N-Curve will be derived. For simplicity, only the 1-dimensional case is re- 
garded, which is also the common use case of Gaussian processes. Following 
this, the control points P; are defined by the mean value 4; and variance ož. 
While the mean function is equal to that of the N-Curve itself (see Equation 
(3.4)), the kernel ky, (¢,t’) for two curve points X = ye Di Nae OP; and 
Y= >. bi Nag (t )Pi at indices t and ¢’ with t,t’ € [0,1], and respective 


mean values uy = Yes Di Nag OMi and uy = >. bi Nalt Mi is defined 
as follows: 
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Kp (tit) = ELX — ux XY — Ay)l 


=E (È bk, n(ti)Pk — ux) (> bi n(t;)Pk — »)| 
k=0 


k=0 


k=0 \k’=0 


n n 
= xy +E} >) (> b(t DP 


n n 
-E > be n(ti)lty Pp | — E > bk, n(tj)uxPk 
k=0 


k=0 


n 
= uxky +E > by n(ti)Dkn(t Be 
k=0 


TE 2 l È beaten PePe | 


k=0 =0,k’#k 


n n 
— Hy }, bk ntik — Hx 2, den(t pee: 
k=0 k=0 


By applying E[P; - P;] = E[P;] - E[P;], which follows from the independence 
of the control points, and E[P?] = Var[P;] + (E[P;])?, follows the closed- 
form solution 


n 


kp (titj) = Uxby + >) din(tiden(t og + HD 


k=0 
n n 
+B >> Babe (3.19) 
k=0 \k’=0,k'’#k 


n n 
— Hy }, Dente — Hx 2, den(tp ee: 
k=0 k=0 


It can be noted, that the diagonal elements of a covariance matrix obtained 
by kp .(t,t’) correspond to the interpolated covariances of the given N-Curve 
as defined in Equation (3.5). 
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Now, with the N-Curve model covering a specific subset of Gaussian pro- 
cesses, the main commonalities and differences between both formulations 
will be discussed briefly. Although both, Gaussian processes and the N-Curve 
model target a distribution over functions, or rather parametric curves in the 
case of N-Curves, there is a key differences worth mentioning, in that both 
approaches provide a different perspective on the task of distribution mod- 
eling. While Gaussian processes pursue a bottom-up approach, especially 
in Gaussian process regression [Ras06], N-Curves provide a top-down ap- 
proach. As such, in Gaussian process regression, the relation between Gaus- 
sian “curve points” is modeled explicitly using the covariance function. Then 
treating these curve points as part of a partitioned joint distribution ensures 
the GP property. In the N-Curve model, the distribution over functions is 
achieved by modeling the curve-defining control points stochastically, which 
dictate the relation between curve points implicitly. Thereby, the GP property 
follows from the correlation between curve points, which emerges from geo- 
metric constraints given by the underlying Bézier curve, i.e. the curve points 
being linear transformations of the same set of stochastic control points. 


In order to conclude this short section on the connection between N-Curves 
and Gaussian processes, a few illustrations are given, which compare com- 
monly used Gaussian process kernels with different N-Curve kernels. After 
that, a simple toy example, depicting the calculation of the posterior distri- 
bution given a few observation of a target function, is provided. For these 
examples, zero mean Gaussian processes are considered only. 


Figure 3.12 illustrates a radial basis function (abbrev.: RBF, [Gör19]) kernel 


kL t’) -AA t-r (3 20) 
at) = 0° exp | -— p | : 
with o = 1 and l = 0.25, a linear kernel [Gör19] 

kin, c(tst’) = of + 0°(t —c)(t' — c), (3.21) 


with o = gp = c = 0.5, and two N-Curve kernels ky (t,t’) and Kp (t,t’). A 
consists of two unit Gaussians, i.e. N(0,1), and P, consists of 9 zero mean 
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Gaussian control points with standard deviations 09 = 0g = 1,0, = 0} = 
1.25, 09 = 06 = 1.5, 03 = O; = 1.75 and g4 = 2. The standard deviation 
increases towards the center of the control point set, in order to cope with 
non-linear blending (see also Section 3.1.2). For each kernel, the covariance 
matrix has been calculated for 20 equally spaced values ranging from 0 to 1. 


a. 


(a) RBF kernel Kol). (b) Linear kernel 


lin ' 
0=0.5,05=0.5,c=0.5(E t’). 


(c) N-Curve kernel ky (t,t’). (d) N-Curve kernel kp, (t,t’). 


Figure 3.12: Covariance matrices for 20 equally spaced values in [0,1] obtained by using differ- 
ent Gaussian process kernels. 


When comparing the covariance matrices in figures (b) and (c), it can be seen, 
that the results from the kernel based on a linear N-Curve with unit Gaussian 
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control points look similar to those based on the given linear kernel. In fact, 
the covariance matrix calculated with ky is equale to the covariance matrix 
calculated with kGX0.5,0,=0.5,c=0.5(tst’) when normalizing its values to [0, 1]. 
On the other hand, the covariance matrix obtained with ky, (figure (d)), which 
is derived from a more complex N -Curve, tends to be more comparable to the 
covariance matrix calculated with ee L1=0.25(¢,t’) (figure (a)). 


In combination with a mean vector, 0 in this case, each covariance matrix 
defines a prior distribution for a Gaussian process. Following this, Figure 3.13 
depicts sample functions drawn from each prior distribution, again showing 
the parallels between the kernels. 


0.0 0.2 04 0,6 08 1.0 
x x 
(a) RBF kernel ket iso.2s(tst"). (b) ness kernel F 
Ko=0.5,0,=0.5,0=0.5 (bt): 
2.04 — 
15 
10 
0.5 
a 0.0] — — m — 
-0.5 
-10 
-1.5 
-2.0 
0.0 0,2 04 0,6 0.8 10 
x 
(c) N-Curve kernel ky (t,t’). (d) N-Curve kernel kp, (t,t’). 


Figure 3.13: Samples drawn from prior distributions using different Gaussian process kernels. 
The 20 region around the mean value is depicted as a red shaded area. 
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Finally, the Gaussian processes defined by the RBF kernel Kee ass) and 
the N-Curve kernel kp, are used to approximate f(x) = sin(8x) on [0,1] 
using 4 observed data points. Using these data points, the posterior distri- 
bution of each Gaussian process can be calculated, which ideally tends to fit 
the targeted function with an increasing number of observed data points. The 
posterior distributions for both Gaussian processes are depicted in figure 3.14. 
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(a) RBF kernel I 1=0.25(6t'). (b) N-Curve kernel kp, (t,t). 


Figure 3.14: Posterior distributions of Gaussian processes obtained by using different kernels 
given 4 data points (circular markers) of a sine function (dashed line). The 20 region 
around the mean value is depicted as a red shaded area. 


3.2 Modeling Infinite Stochastic Processes 


The N-Curve model as presented in the previous Section 3.1 is viable for 
most real-world applications, which are generally concerned with sequences 
of fixed or at least bounded length. Thus, using Bezier curves as a basis, the 
representable curve complexity suffices the requirements of given sequen- 
tial data. However, apart from the practicality of the N-Curve model, there 
exists a conceptual limitation for modeling continuous-time stochastic pro- 
cesses {X, 4er. This limitation is given by the bounded index set T = [0, 1], 
which is imposed by the use of a Bézier curve basis. Because of this, infinite 
continuous-time stochastic processes, i.e. with T = Rg, cannot be repre- 
sented by the N-Curve model. Further, infinite discrete-time stochastic pro- 
cesses {X;};e7,, Which model open-ended sequences and are realized using 


40 


3.2 Modeling Infinite Stochastic Processes 


a discrete subset Ty of T, are also affected by this limitation in a more sub- 
tle way. Although the number of control points of an NV-Curve is fixed, it is 
still possible to extract an infinite number of curve points with infinitesimal 
distance between subsequent curve points. However, as a sequence becomes 
longer, it generally also expands in space. Thus a potentially more complex 
underlying Bézier curve, i.e. a curve of higher degree, is required for achieving 
an accurate approximation. While there is no theoretical limit to the number 
of control points defining a Bézier curve, the approximation quality may suf- 
fer from an increasing number of control points in practice. This is due to the 
increased number of concurring control points, each contributing to every 
curve point (global control). 


In the context of parametric curves, a common approach to tackle increas- 
ing curve complexity in terms of length and shape is the use of segmented 
curves. Here, simpler curves of fixed degree are stitched together in order to 
form a more complex curve, granting local control over curve segments. Thus, 
the number of segments can be increased as required, without affecting the 
entire curve. In the context of Bézier curves, such a curve is then called a 
composite Bézier curve or Bézier spline [Reb21]. Following this, a Bézier spline 
of degree N4eg is defined in terms of a sequence of N, Bézier curve segments 
P = {P], ..., Py}, where each segment is defined by its own set of control 
points P; = {Pi, no Eg Further, at least C? continuity, i.e. Au. =- it 
holds for subsequent Bézier curve segments. 


If necessary, additional smoothness requirements, e.g. C! or C? continuity, 
can be added. Under C! continuity, subsequent curve segments have identi- 
cal tangents at the control point joining both segments. C? continuous curves 
additionally have identical curvature at this point [Bar95]. Under C® conti- 
nuity, Bézier splines grant local control, ie. the curve can be altered on a 
per segment basis without affecting other segments. This flexibility is re- 
stricted when enforcing C! or C? continuity, as neighboring control points 
of subsequent curve segments become dependent on one another. Geometri- 
cally, C? continuity can be enforced by making the second last control point 
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PN jeg of a curve segment and the second control point P+} of the subse- 
quent curve segment collinear. For C? continuity, these control points addi- 
tionally have to have the same (euclidean) distance from the joining control 
point PR tex = Ri*!. Following this, local control can be granted to some de- 
gree by using more than 4 control points in a segment. Examples for C°, C! 
and C? continuous segment intersections are given in Figure 3.15. 


Figure 3.15: Examples for Bézier splines consisting of three segments with varying continuity 
constraints at segment intersections. Figure (a) depicts a C? continuous Bézier 
spline and (b) depicts a Bézer spline meeting C? continuity at the intersection of 
the first two segments and C1 continuity at the intersection of the second and third 
segment. 


3.2.1 The Meta-time N-Curve Model 


Given the aforementioned conceptual limitations, the goal is to extend the 
N-Curve model for infinite stochastic processes and open-ended sequences, 
ie. a stochastic process with index! t € Rg. For this purpose, the concept 
of splines is incorporated into the model, thus combining N-Curve segments 
into a more complex probabilistic spline. In order to model an indefinite num- 
ber of curve segments, a notion of control point evolution is introduced, by 
defining the set of Gaussian control points as a function P y(t) of time t. Due 


* The index t will again be interpreted as time for a more intuitive derivation of this extension. 
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to each function value then defining an entire curve segment covering multi- 
ple time steps, a (potentially) asynchronous timeline emerges. This timeline 
is denoted as meta-time with index f in the following. Subsequent values of 
i then yield subsequent N-Curve segments on the probabilistic spline mod- 
eling the stochastic process. This probabilistic spline will be denoted as a 
meta-time N -Curve in the following. With the control point function repre- 
senting a sequence of Gaussian control point sets, the meta-time N -Curve is 
defined as a sequence of connected N-Curve segments d = HR, Toe Hr, 
indexed by meta-time f € No. For associating a point in time t with the cor- 
responding N-Curve segment, the meta-time mapping m : t > f onto the 
meta-timeline is introduced. In addition, with t (and f) now exceeding the in- 
dex range of an N-Curve, another mapping m. : t —> tẹ onto the curve-time 
parameter t € [0, 1] is introduced in order to access the exact curve point on 
a curve segment. This mapping is denoted as the curve-time mapping. Note, 
that m, is technically defined as m, : (t,f) > te, but as f is derived from t 
through m, the additional parameter can be omitted. Finally, in the context 
of meta-time N-Curves, C°? continuity is given by matching the mean vectors 
and covariance matrices at the intersection of subsequent N -Curve segments. 
Aforementioned geometric restrictions for C! and C? continuity only apply 
to control point mean values. Figure 3.16 gives an illustration of the different 
timelines and the basic idea of the model extension. 
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Figure 3.16: Illustration of the proposed extension of the N-Curve model, basing stochastic pro- 
cess modeling on probabilistic Bézier splines instead of single Bézier curves, thus 
allowing to model infinite stochastic processes. Interpreting the stochastic pro- 
cess index t as time, multiple connected timelines emerge, namely meta-time f and 
curve-time te. Given a specific point in time t, the corresponding N-Curve seg- 
ment is determined by the mapping m(t) and the specific point on the segment 
by m,(t). An exemplary resulting probabilistic spline By(t,~) is depicted at the 
bottom. 


With the introduced timeline mappings, the original formulation of the N- 
Curve model can be extended into the meta-time N-Curve model as depicted 
in Table 3.1. Here, only the definition of the extended N-Curve is provided, 
as the derivation of other formulas building on the curve definition, e.g. the 
curve point probability density function (Equation (3.6)), is not directly af- 
fected by these changes. Further, exemplary definitions are provided for both 
the meta-time and the curve-time mapping. Potential definitions for these 
mappings are discussed in Section 3.2.2. 
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Table 3.1: Overview of changes made to the N-Curve model in order to derive the meta-time 
N-Curve model extension. For completeness, examples for the meta-time and curve- 
time mappings are provided. 


Targeted stochastic process 
N-Curve Gr = {Xi relo, 
Meta-time N-Curve Gr = {Xt}rert 
Curve (segment) definition 
N-Curve y= {P -> PNrauu i i 
Meta-time N-Curve w= {{Pf, ee Nae SHEEN? Un = {Ph ..., Nat 
Curve points (see Equation (3.3)) 
N-Curve By(t, Y) = (u? (t), 2° (0) 
Meta-time N-Curve By(t, Y) = (u?0"™)(m,(t)), PRO met) 
Mean function (see Equation (3.4)) 
N-Curve PO = De bu „OR: 
Meta-time N-Curve Y(t.) = AR, Di Nag (te His Hi € PO) 
“ Covariance function (see Equation (3.5)) 
N-Curve DPC) = Del, (Zi 
Meta-time N-Curve ZI.) = DE Digna (te)? En 2 € PO 
“ Meta-time mapping 
N-Curve - 
Meta-time N-Curve f= m(t) = | 


- Curve-time mapping 
N-Curve - 
Meta-time N-Curve te = Me(t) = t — m(t) 


On a final note, the meta-time N -Curve model can be used in the context of 
multi-modal stochastic processes by following the same approach as described 
for the original N-Curve model using a mixture of meta-time N-Curves. In 
this case, each mode of a stochastic process representation follows a separate 
meta-time N-Curve. The mixture weight distribution is defined on a per seg- 
ment basis, ie. (f) = {r!,...nk} is given at meta-time f. Following this, a 
potential benefit of meta-time N-Curves is given by the fact, that it is possible 
to alter the curve weights in each meta-time step. This allows for more control 
about the number of required mixture components on a per-segment basis. 
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3.2.2 Mapping Functions 


In the context of the meta-time NV-Curve model, two mappings have been de- 
fined, namely the meta-time mapping and the curve-time mapping, connect- 
ing the introduced timelines. As there possibly exists a wide range of options 
for defining either mapping, this section provides exemplary mapping defini- 
tions, which are expected to be relevant for an actual implementation of the 
model in the context of different sequence modeling tasks. 


Meta-time mapping: Moving along the meta-timeline yields a sequence of N- 
Curve segments along a probabilistic spline. In this context, the meta-time 
mapping maps a given point in time t onto the meta-timeline, i.e. a natural 
number f € No, in order to determine the corresponding N-Curve segment. 
Following this, the goal is to define a consistent meta-time mapping from t 
onto the set of natural numbers. 


The first possible definition of this mapping is given by a fixed interval map- 
ping 


ma(t)=|=|. (3.22) 
a 

In this case, the meta-time f is advanced at a fixed rate as t increases, thus 
traversing an infinite sequence of (distinct) N-Curve segments. While this 
definition may result in premature segment changes, an interesting special 
case is given for a = 1, where the resulting meta-time N-Curve resembles 
a probabilistic spline with segments connected at their endpoints. Besides 
resulting in a well-defined segmented probabilistic curve, this further allows 
a straightforward definition of C! and C? continuous segment intersections 
via control point placement. 


In cases, where periodic repetitions are expected in sequential data, another 
definition is given by a periodic mapping 


Ma,b,p(t) = | arcsin (sin (+ . £) + b| ; (3.23) 
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which can be based for example on a triangle wave. Here, t is mapped onto 
the same sequence of N -Curve segments periodically, thus repeating the same 
segment sequence over and over. The repetition frequency is controlled by p. 
An alternative to the periodic mapping is given by a modulo reset mapping 


Max) = |= | mód k (3.24) 


with k > 1, which repeats the same segment sequence after every k meta- 
time steps. This mapping is basically built on a sawtooth wave. Similar to the 
fixed interval mapping, these two definitions can be parameterized to yield 
an endpoint-connected probabilistic spline. 


Curve-time mapping: After determining the current N-Curve segment at time 
t, the current position within this segment needs to be determined. For this, 
a curve-time mapping needs to be defined, mapping t onto curve-time tẹ € 
[0,1]. As this mapping is highly dependent on the specific definition of the 
meta-time mapping m, an exemplary curve-time mapping compliant with the 
given variants of m, which result in an endpoint-connected spline, is provided. 
All these variants map a given time t in a way, that a segment intersection 
occurs whenever t is a multiple of 1. Following this, the difference between 
the value of t for the first and last segment point is exactly 1 and intermediate 
values are in [0, 1]. As such, the curve-time mapping can be defined as 


m.(t) = t — m(t). (3.25) 


Learned mapping: On a final note, it is also possible to learn both mappings in 
a constraint optimization setting. A potential benefit of this can be given by a 
resulting efficient re-use of few base segments, especially when the mapping 
can be conditioned on additional domain-specific input. On the downside, 
depending on the constraints defined during optimization, spline properties 
regarding C! and C? continuity might be lost. 
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3.2.3 Modeling discrete-time stochastic processes 


As a last point, this section provides some insight into how the concept of 
meta-time N-Curves can be applied for modeling discrete-time stochastic 
processes using different mapping variants. For this, only mapping variants 
resulting in an endpoint-connected probabilistic spline are considered, being 
expected to be the most relevant in an application context. 


Recall in the N-Curve model, a discrete index set Ty C T is extracted from 
the continuous index set T = [0,1] using a predetermined sequence length N 
(see also Section 3.1). Now, with T = R¢ and sequences being unbounded 


in length, it is necessary to define the sequence length Neg covered by each 


seg 
N-Curve segment along a meta-time N-Curve. Following this, the difference 
between subsequent stochastic process indices t; € Ty and t;,ı € Ty is 


dictated by N eg through At = 5 2 


. Using this formulation, N = N eg is a 


seg seg 


necessary condition in the original N-Curve model. Opposed to that, in the 


seg 


meta-time N-Curve model it is possible to have N > Neg, which ultimately 
allows modeling open-ended sequences through generating a stream of N- 
Curve segments. The discrete index set Ty can now be re-formulated as 


UD 
Ty = 49), Atlv € {0,...,N —1}f. (3.26) 
i=0 


The meta-time mapping m(t;) is now required to fill the additional role of 
determining how many N -Curve segments are required to model a sequence 
of length N given N,... In the case of the fixed interval mapping 


Mg=i(ti) = lti], 


the N'* element of a sequence lies on the Ma-1(ty) th N-Curve segment. 
Thus, the meta-time N-Curve model needs to generate 


Mg=1(tn) = 
seg 
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segments compliant with given continuity constrains. With N — œ, this 
results in an indefinite stream of distinct N-Curve segments, connected at 
their endpoints. Looking at periodic mappings, e.g. the modulo reset mapping 


Ma=1,k(ti) = |ti] mod k, 


the meta-time N-Curve model needs to generate a maximum of k segments 
which are referenced in a loop via Mg=1,x. The maximum number of segments 
is required when N > (k — 1) - N... For N > oo, this results in an indefinite 
repetition of the same k N-Curve segments. Finally, in both cases, specific 
random variables along the meta-time N-Curve are determined according to 
m. and the corresponding N-Curve segment. 


3.3 Summary 


The main contributions of this chapter are twofold. First, a probabilistic ex- 
tension to Bézier curves (N-Curves) was introduced, which models sequences 
of Gaussian probability distributions along a parametric curve. Thereby, an 
N-Curve is defined in terms of stochastic control points. Further, it has 
been shown that N-Curves are a special case of Gaussian processes. Sec- 
ond, a model building on mixtures of N-Curves was presented, which en- 
ables the modeling of multi-modal stochastic processes. Using the N-Curve 
model and its meta-time variant, finite and infinite, as well as discrete-time 
and continuous-time stochastic processes can be modeled. 
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This chapter provides an implementation for the N-Curve model (see Section 
3.1), based on Mixture Density Networks (MDN). Therefore, Section 4.1 first 
extends on the introduction of Mixture Density Networks as given in Section 
2.2.2 and then proceeds to define N-Curve Mixture Density Networks (ab- 
brev.: N-MDN). The definition of the N-MDN is accompanied with several 
toy examples, exploring the capabilities of the N-Curve model. 


In addition to the N-MDN, this chapter provides a proof of concept for the 
conceptual extension of the N-Curve model, given by the meta-time N -Curve 
model as described in Section 3.2. As such, a recurrent extension of the N- 
MDN is introduced and briefly evaluated on multiple toy examples in Section 
4.2. 


4.1 N-Curve Mixture Density Networks 


Defining a fully regression-based probabilistic sequence model is one of the 
main objectives pursued in this thesis. Following this, an MDN for learn- 
ing the parameters of an N-Curve mixture (see Section 3.1) from discrete se- 
quence data is proposed. An MDN is a feed-forward neural network 


O(v) = (TR Mk irdkeiı,..,Kı|V)> (4.1) 


that takes an input vector v and maps it onto the parameters of a d- 
dimensional, K-component Gaussian mixture distribution. In order to ensure 
that the MDN generates a valid set of mixture parameters, the partitioned 
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network output 


y = (õis .... 7K, fa, .... fir; 1, e.. OK, ĝi, PR), 
with 
d2-d 
Tr ER, Ax E Rİ, 5, eR? and k ER 2 
is further transformed to meet parameter value requirements, i.e. 


Tpk = softmax(7),..., ŽK )ko 


such that >), 7% = 1 and 


Hk = Bx 
Oki = Kr) >ovie 11, pee d} 


n ; d? -d 
Pk, j = for, j) E [-1,1] Yj € I so l. 
Note that the covariance matrices 2; are calculated from the standard devia- 
tions and correlations in order to ensure positive definiteness. For the trans- 
formations f, and f,, there are several relevant options to consider. The orig- 
inal formulation [Bis94] employed 


fo(x) = exp(x) and f,(x) = tanh(x) (4.2) 


to transform o and p into respective value ranges. Both of these functions, 
however, can lead the MDN into having numerical issues during training. In 
the case of f,, the exponential function yields instable optimization results for 
large input values due to its exponential growth. To cope with this, a shifted 
version of the Exponential Linear Unit (abbrev.: ELU, [Cle15, Gui17]) 


fo(x) = ELU(1,x) +1 (4.3) 
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with 


a(e*—1) forx <0 


BU = for x >0 


(4.4) 


and the softplus function (also referred to as SmoothReLU, [Dug01, Glo11, 
Iso17]) 


(x) = softplus(x) = In(1 + e”) (4.5) 


are commonly used in MDNs. These functions are similar to the exponen- 
tial function for negative and small positive input values, but transition into 
a linear function for larger input values. From an optimization point of view, 
the softplus function may be preferred over the ELU, as the latter is non- 
continuous in its derivatives [Sch20b]. Regarding the correlations p, using 
the tanh function for f, can lead to vanishing gradients. Thus the softsign 
function [Glo10, Iso17] 


fo(x) = softsign(x) = (4.6) 


x 
1+ |x| 


may be used instead, despite having more complex derivatives [Sza21]. A 
schematic for an MDN generating a 2-dimensional Gaussian mixture distri- 
bution is depicted in Figure 4.1. 
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Value 
Mapping 


Linear Readout Se 
> 


EER 


Neural Network 


Input v 


Figure 4.1: Schematic of a Mixture Density Network generating a 2-dimensional Gaussian mix- 
ture distribution. The outputs of a (feed-forward) neural network are linearly trans- 
formed and mapped onto respective parameters value ranges in order to determine 
the parameters, {77k, Mk, Ekłke{1,...,K}» of a Gaussian mixture distribution. The co- 
variance matrices Xç are given in terms of the standard deviations Og į and the cor- 
relations Px,j. For illustration purposes, the mean vector values Mz,; and Hx,2, as 
well as the standard deviations Og, ı and Ox,; for each mixture component are not 
displayed separately. 


Following this, Mixture Density Networks can be adapted easily to output the 
parameters of a K-component N-Curve mixture with N-Curves of degree 
Naeg by generating the parameters {Px }ket,... K} = Ku, SE heen,....K} for all 
K - (Naeg + 1) stochastic control points and the respective K curve weights 
{7 3ken,....K} Advantages of using an MDN for learning the N-Curve mix- 
ture parameters, rather than other algorithms (e.g. Expectation Maximization 
[Dem77]), are twofold. First, MDNs allow its output distribution to be con- 
ditioned on arbitrary inputs. Thus, the MDN provides an easy approach to 
learn and process conditional N-Curve mixtures, allowing the model to be 
used in a conditional inference framework. Second, the MDN can be incorpo- 
rated easily into (almost) any neural network architecture without the need 
to control the gradient flow. Besides that, there are two notable drawbacks of 
MDNs to consider, namely mode collapse due to overfitting and instabilities 
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during training [Mak19]. In the context of MDNs, mode collapse commonly 
refers to a problem, where the MDN puts all weight on a single low-variance 
component, regardless of the number of available components. While insta- 
bilities during training can be mitigated by choosing appropriate functions 
for fọ and fp, mode collapse is expected to be reduced by setting MDNs in the 
context of N-Curves, as modeling parametric curves instead of single points 
is expected to yield more distinct modes. 


The most commonly used loss function for training an MDN is the negative 
log-likelihood [Bis94], which can also be adapted for training the N-Curve 
Mixture Density Network from discrete sequence data. Let $ = {8j,..., Sm} 
be a set of M realizations of a stochastic process with S; = {x ig xi} where 
each x! withi € {1, ..., N} is a sample value for the respective random variable 
X, at time i for t; € Ty (see Section 3.1). In order to simplify the training 
procedure, Gaussian random variables along the N-Curve are treated as if 
they were independent. Following this, the joint probability of the samples 
x! in a sequence S; along an N-Curve 4 factorizes into the unnormalized 
Gaussian density 


N 
p?(S)) = p?(x’,....x)) = [[ pi). (4.7) 
i=] 


This is exploited when defining the loss function. It should be noted, that 
this simplification can be justified, as the correlation between these Gaussian 
random variables is enforced implicitly by the underlying N-Curve and thus 
by the stochastic control points, which are estimated during training. 


55 


4 Proposed Implementation 


For a single sequence S; and an N-Curve 9 = ®(v), the loss function is then 
defined by the negative (unnormalized) log-likelihood 


£ = -log pE, x) 


N . 
= — log (H ae ) 
i=l 


n 4.8 
= — )) log ph (x!) ” 
i=1 
N . 
=— >, log p(x! u” (ti), =¥(t;)) 


i=1 


of the sequence given an input vector v. Therefore, the loss for a set of M 
sequences S = {64, ..., Sm} is defined as 


=> c=5(-7 2, log plxi MOLAC >) (4.9) 


= i=1 

Equation (4.9) can easily be extended for N-Curve mixtures. Given an N- 
Curve mixture ¥ = ®(v), the likelihood of a single training sequence S; is 
now calculated as the weighted linear combination of the likelihood of S; for 
each x (see Equation (4.7)): 


K 
P*(S;) = >) mKp**(S)). (4.10) 
k=1 


56 


4.1 N-Curve Mixture Density Networks 


Thus, the loss for a set S of M sequences can be defined as 


M K 
L= M > —log = Tr. PFK(S ) 
j=l k= 
1“ K N ! 
= y los D m TT Pee) (411) 
j=1 k=1 i=l 
ie K N 
= M > — log ye exp (ve + > log (pta) : 
j=l k=1 i=1 


Then, the N-Curve Mixture Density Network (abbrev.: N-MDN) can 
be trained using a standard gradient descent policy. Most commonly, 
momentum-based gradient descent optimizers are employed. Popular choices 
include Adam [Kin15] and RMSprop [Rud16]. It should be noted, that from an 
optimization point of view, it is preferred to use the mean of the likelihoods 
when long sequences or many samples should be processed, as the sum of 
negative log likelihoods may result in large loss values and thus a less stable 
optimization. Further, it is recommended to output the mean vectors relative 
to the last element of the input sequence instead of their absolute values. 
That way, the N-MDN learns a residual mapping, which have proven to be 
easier to optimize and yield more accurate results [He16][Hug17]. Finally, 
the loss function £ given in Equation (4.11) is arranged in a way, such that 
the log-sum-exp trick [Pre07] can be applied. This trick prevents arithmetic 
underflow by offsetting the values in the exponent, according to 


log È exp en = log » exp {Z; = zul + Zmax: 


In this implementation of the N-Curve (mixture) model, the input vector v 
allows the model to be used in either a conditional or a non-conditional in- 
ference setting. Examples for both settings include sequence prediction (con- 
ditional) and the estimation of the data generating distribution given some 
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dataset (non-conditional). Regarding the conditional case, the stochastic pro- 
cess, and thus the N-Curve mixture, depends on some input sequence S. This 
sequence may be encoded into the input vector v using some encoder Enc(8). 
A common choice for sequence encoders are recurrent neural networks and 
variants, such as the Long Short-Term Memory (LSTM) and the Gated Recur- 
rent Unit (GRU). In the non-conditional case, v can be set to some constant 
value. While, technically, this also gives a conditional N-Curve mixture, there 
is no variation in v, resulting in a constant mixture. 


Subsections 4.1.1 - 4.1.4 provide several toy examples using synthetically gen- 
erated data to showcase different features and the functionality of the pro- 
posed implementation of the N-Curve model, based on Mixture Density Net- 
works. In order to remove as much complexity as possible, the input vector 
v is set to be constant, thus creating a non-conditional sequence synthesis 
setting. Following this, the N-MDN learns to generate an N-Curve mixture, 
which estimates the underlying stochastic process generating the provided 
data. For all of the toy examples, a PyTorch [Pas19] implementation of the 
N-MDN is used. The model is trained using the Adam optimizer with the 
learning rate set to 0.01. All other parameters are left at PyTorch defaults. 
With low-dimensional sequence datasets being small in size, there is no need 
to perform batch optimization, as the entire dataset fits into memory. As such, 
the entire training dataset is processed with each iteration of training. Dur- 
ing training, the model is assumed to have reached convergence, when the 
training loss stagnates for 10 iterations. 


4.1.1 Estimating N -Curve mixtures from noisy data 


For testing the N-Curve Mixture Density Network’s capability of learning 
the parameters of an N-Curve mixture from noisy sequence data, a simple 
experiment is conducted. To enable a proper visualization of the results, this 
example uses 2-dimensional data. Following this, an experiment is set up 
as follows: 


1 Define an arbitrary 2-component N-Curve mixture Y,, with 5 
Gaussian control points per component by defining the weights 7, as 
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well as the mean vectors ee and covariance matrices 25, for the 
control points PK. of each N-Curve Yp. The weights are set to 
x = {0.75, 0.25} resulting in a biased training dataset. 


2 Draw a set of M = 1000 Bézier curves from Y. 


3 Determine the discrete index set Ty=2ọ consisting of 20 (arbitrary, but 
fixed) evenly distributed values for t € [0, 1] (see also Section 3.1) in 


order to discretize each of the M Bézier curves into a set § of sample 
sequences. 


4 Apply Gaussian noise to each sample sequence in order to create a 
more realistic training dataset. 


5 Train the N-MDN using S and check if the network is capable of 
reconstructing PY. 


In this way, the stochastic control points defining the N-Curve mixture have 
to be estimated indirectly through a set of sample sequences. The ground 


truth N-Curve mixture Y, and a sample sequence is depicted in Figure 4.2. 
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(a) Sample Bézier curve. 


(b) Discretized curve sample with added 
noise. 


Figure 4.2: Ground truth N-Curves pı and tz (red and green) starting at (0,0) alongside a 
sampled Bézier curve (blue). Both, (a) and (b), show the mean curve and the control 
points with covariance ellipses for both N-Curves. In (a) the Bézier curve is illus- 


trated as sampled from tp , while (b) shows the discretized version of the sample 
curve with Gaussian noise applied. 
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The training dataset S and the evolution of the parameters defining the esti- 
mated N-Curve mixture Y req over several training iterations are depicted in 
Figure 4.3. In order to make the quality of the estimation more comprehen- 
sible, Figures 4.3c - 4.3f show the deviation of the estimated from the actual 
parameters. The deviations are defined per control point pk as 


k k 
Au = lH prea,i — Bet ill2 
k a 
gtt 


— pk k 
Ap = Ppred,i = Pot,i- 


Ag = ok 


pred,i co 


(4.12) 
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Figure 4.3: Training dataset (a) and estimated mixture parameters over the course of the training 


((b) - (£)). (c): Euclidean distance between the estimated mean vectors and the ground 
truth. (d) - (f): Signed difference between estimated and ground truth standard devi- 
ations and correlations, revealing the occurrence of under- and over-estimation. In 
(c) - (f), values corresponding to the first and second mixture component are shown 
as a solid and dashed line, respectively. The depicted colored lines correspond to the 
control points {Po, P1, P2, P3, P4} of each N-Curve in the mixture. 


61 


4 Proposed Implementation 


Looking at Figures 4.3b and 4.3c, it can be seen that the N-MDN is well capable 
of estimating the actual component weights and control point mean vectors. 
On the other hand, the model seems to over-estimate the standard deviations 
slightly (Ao % 0.1) and there appears to be a rather large discrepancy between 
the estimated and actual correlation values. A possible explanation for these 
discrepancies can be found when looking at the covariance matrix from a ge- 
ometric point of view. In general, the covariance matrix not only controls the 
amount of dispersion in data drawn from a corresponding Gaussian distribu- 
tion, but also the orientation of the principal axes of dispersion. As such, the 
covariance matrix can be interpreted as a linear transformation defined by a 
rotation matrix R and a diagonal scaling matrix S, such that X = RSSR"". 
In the 2-dimensional case, the data dispersion can be visualized by an (ro- 
tated) ellipse. The orientation of this ellipse is controlled by the covariance 
matrix off-diagonal element Xp) = 21,9 = oxo). Following this, when ei- 
ther o is over-estimated, the error in ellipse orientation can be compensated 
by adapting the correlation p. Besides that, there likely exist multiple simi- 
lar solutions to the covariance matrix defined by different correlation values 
generating a similar set of samples. As such, the N-MDN only finds a locally 
optimal solution. 


Finally, Figure 4.4 depicts the Y,,.a at different stages during training, illus- 
trating the process of estimating Y,.. It can be seen, that the position of 
the control points is estimated well. Further, the orientation preservation 
assumption can be confirmed looking at Figure 4.4d. The orientation of the 
estimated covariance ellipses is similar to the real ellipses, but the variances 
are slightly over-estimated. 


* The transformation matrices can be obtained by an Eigendecomposition. 
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(c) Iteration 5000. (d) Iteration 10000. 


Figure 4.4: Ground truth N-Curve mixture (red and green) and mixture estimated from noisy se- 


quence data (blue and purple) after O (random initialization), 1000, 5000 and 10000 
iterations of training. 


4.1.2 Handling heteroscedastic data 


This example examines the capabilities ofthe N-Curve model in handling het- 
eroscedastic data, i.e. a stochastic processes with varying variance between 
time steps. Using heteroscedastic data especially allows the examination of 
the modeling accuracy when varying the number of N-Curve control points. 
In contrast to the previous toy example, this and further examples are only 


concerned with curve points along the estimated N-Curve (mixture), repre- 
senting the actual stochastic process. 
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In order to keep the experiment as simple as possible, the dimensionality of 
the data is set to 1. Thus, a dataset consisting of 1000 sample sequences is 
generated using an unimodal, time-discrete stochastic process Gr = {X; ~ 
N(Hi, Fi) }teTy_,, With mean values moving along a parabolic curve and cor- 
responding standard deviations changing between consecutive process in- 
dices t; and t;,,. The stochastic process alongside sampled process realiza- 
tions is given in Figure 4.5. 


(a) (b) 


Figure 4.5: Illustration of the training samples drawn from a heteroscedastic stochastic process. 
(a): Ground truth discrete-time stochastic process Gr = {Xt}rery_,,- Standard 
deviations (0, 20 and 30) along Gr are illustrated as a shaded region around the 
mean curve. 30 for each X; is indicated by a horizontal dashed line. (b): Sample 
sequences drawn from Gr. 


Estimated N-Curves with 5 and 15 control points generated by an N-MDN 
are depicted in Figure 4.6. It can be seen, that the N-MDN learns a smooth 
mean curve and compensates variation in noise using the variance of the con- 
trol points. With an increasing number of control points, an increasing num- 
ber of variations in input noise can be compensated. This, however, comes 
at the cost of a less accurate mean curve, due to the increasing degree of the 
underlying polynomial curve. Note that the N-Curves still model a time- 
continuous stochastic process, despite being given discrete data. Intermedi- 
ate values are interpolated according to the Bernstein polynomials (see also 
Section 3.1). 
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Figure 4.6: Approximations of Gr using N-Curves with 5 (a) and 15 (b) control points as gen- 
erated by an N-MDN trained with noisy sequence data. Using more control points 
increases the accuracy in modeling the variance of the stochastic process at the cost 
of a less accurate mean curve. For reference, the training sequences (without con- 
nections between subsequent points) are illustrated using black cross markers. 


4.1.3 Presence of superfluous mixture components 


When dealing with multi-modal sequence data, the actual number of modes 
is usually unknown. Following that, this toy example is concerned with the 
impact of superfluous N-Curve mixture components on the N-MDN train- 
ing, as well as the resulting mixture model. For this experiment, a bimodal 
discrete-time stochastic process Gr = {X;}:,eTy—,) With constant variance is 
defined for generating a training dataset. Here, each random variable of the 
process follows a bimodal Gaussian mixture distribution and realizations of 
the process follow one of two paths with equal probability. The stochastic 
process alongside process realizations is depicted in Figure 4.7. 
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Figure 4.7: Illustration of the training samples drawn from a multi-modal stochastic process. 
(a): Ground truth discrete-time stochastic process Gr = {Xt}reTy_j)- Standard 
deviations (0, 20 and 30) along Gr are illustrated as a shaded region around the 
mean curves. 30 for each X; is indicated by a horizontal dashed line. (b): Sample 
sequences for both modes drawn from Gr. 


A training dataset consisting of a set of 100 realizations of the aforemen- 
tioned stochastic process is now used to estimate an N-Curve mixture with 
K = 6 components, i.e. 4 superfluous components, with 5 control points each. 
Preferably, in the resulting model, the weights of all 4 unnecessary compo- 
nents are driven towards 0 and the remaining N-Curves model the two modes 
of the stochastic process. The resulting N-Curve mixture components after 
training the N-MDN are depicted in Figure 4.8. The components are ordered 
in descending order by their associated mixture weight 7x. 
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Figure 4.8: Estimated N-Curve mixture with K = 6 components generated by an N-MDN. 
Two components ((a) and (b)) represent Gr with accurate weighting. The weights 
of superfluous components is driven towards O during training. The shape of these 
components is thus not further optimized at some point (see figures (c) - (f)). 


Looking at the estimated N-Curve mixture, the N-MDN behaves as expected. 
During training it learns the weight distribution rather fast, leading to super- 
fluous components being not further optimized in their shape. This can be 
seen in Figure 4.9a. The remaining non-zero components (4.8a and 4.8b) accu- 
rately model Gr with minor over-estimation of the variances. Stripping away 
the superfluous components, the resulting N-Curve mixture is depicted in 
Figure 4.9b. 
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Figure 4.9: Evolution of the weight distribution during training (a) and the resulting approxima- 
tion of Gr after removing low-weight components (b). The coloring in (a) matches 
the components depicted in the previous figure (4.8). For reference, the training se- 
quences (without connections between subsequent points) are illustrated using black 
cross markers in figure (b). 


As a final note, purposefully choosing K > Kea] and relying on the optimiza- 
tion driving superfluous components towards 0 might lead to several simi- 
lar non-zero components when processing more complex datasets. In this 
case, there are several possibilities to cope with this when required. The first, 
and most straightforward approach, is implementing a post-processing step, 
which collapses similar components into one by accumulating their weights 
and averaging the curves. With respect to the training phase, proper regu- 
larization could be employed, trying to enforce sparsity. The most commonly 
used sparsity-inducing regularization is given by the Lı norm [Ng04] applied 
to the mixture weight distribution. Lastly, the determination of K itself can 
be approached from a different perspective by trying to implement the idea 
of the Infinite Gaussian Mixture Model [Ras99]. The basic idea of the infi- 
nite GMM is applying a Bayesian modeling approach to model the mixture 
parameters. As such, the mixture weights are modeled using a Dirichlet prior 
distribution. While this approach removes the problem of choosing an appro- 
priate value for K, it also introduces more complexity into the N-MDN and 
its training. Additionally, estimating the parameters of an infinite GMM usu- 
ally involves Monte Carlo methods or variational inference, making such an 
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extension of the N-MDN counteract the intuition of designing a regression- 
based probabilistic sequence model. 


4.1.4 Comparison with SMC inference 


One design choice for the N-Curve model is to move multi-step inference 
into the training phase, thus allowing for an instant prediction of several time 
steps. This is opposed to sequential monte carlo (SMC) approaches, which 
model the transition between subsequent time steps and perform iterative in- 
ference. In this experiment, the performance of the N-MDN implementation 
of the N-Curve model is compared to an exemplary SMC approach. For this 
comparison, an LSTM-MDN model embedded into a particle filtering cycle 
[Hug18], denoted as ParticleLSTM in the following, is used for generating an 
approximation for a discrete-time stochastic process. As the ParticleLSTM 
expects discrete inputs and outputs a Gaussian mixture probability distribu- 
tion, a new set of samples, also called particles, needs to be drawn from the 
mixture distribution after each inference step. This is comparable to the re- 
sampling step of particle filters [Dou09] and serves two purposes. First and 
foremost, this approach keeps the number of particles constant, tackling ex- 
ponential growth of particles when using a brute force approach. Second, 
it enables the propagation of a sample-based representation of a probability 
distribution through time using an LSTM-MDN. 


For comparing the performance of the N-MDN and the ParticleLSTM in ap- 
proximating a time-discrete stochastic process Gr from noisy sequence data, 
a training dataset is sampled from Gr. In order to keep this experiment clear 
and easier to evaluate, Gr = {X;}:eTy_, is defined to be an unimodal stochas- 
tic process with finite index set Ty—5 = {0, 0.25, 0.5, 0.75, 1}. The stochastic 
process and a training dataset sampled from Gr are depicted in Figure 4.10. 
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(a) (b) 


Figure 4.10: Illustration of the training dataset for comparing the N-MDN with an exem- 
plary SMC approach. (a): Ground truth discrete-time stochastic process Gr = 
{Xt}teTy_, alongside sample sequences. (b): Training data sampled from Gr. Se- 
quential connections are left out in this illustration in order to provide a cleaner 
illustration. 


Using this training dataset, both models are trained until convergence is 
reached. In this experiment, the N-MDN generates a 1-component N-Curve 
mixture with 3 control points and the ParticleLSTM uses N, particles for 
generating its prediction. An approximation of Gr, given v = 0 as a constant 
input, is then generated as follows: In case of the N-MDN, a single pass 
through the network yields the parameters of an N-Curve. Accessing this 
curve att E€ {0,0.25,0.5,0.75,1} =: Ty-s gives the stochastic variables 
{Xirery_, approximating Gr. For the ParticleLSTM, passing N, copies of v 
through the network generates a Gaussian mixture distribution 2; approx- 
imating X,. Next, N, samples are taken from &, and fed into the network 
again in order to approximate X,. This process is repeated for retrieving X3, 
X, and X;. The resulting approximations of Gr are depicted in Figure 4.11. 
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Figure 4.11: Resulting approximations of the stochastic process Gr. Both approaches yield sim- 
ilar results with slight differences in the variances, especially for X3 and X4. 


It can be seen, that both approaches lead to similar results. The most no- 
table differences can be observed looking at the variances, where the N-MDN 
yields slightly less accurate results, which can be attributed to the use of a 
compact representation with only 3 control points. The difference in vari- 
ances is most visible for X3 and X4, where the non-linear weighting of control 
point covariances yield a minor under-estimation of the actual variance val- 
ues (see also Section 3.1.2). On the other hand, the mean curve generated by 
the ParticleLSTM is less stable, which is most likely due to the uncontrolled 
and stochastic nature of the iterative approach. 


While the comparison of the resulting approximations confirms the viability 
of the N-Curve model as an alternative to SMC approaches, additional as- 
pects should be considered. By design, the N-Curve model moves the task 
of multi-step inference into the training phase, thus eliminating the need for 
Monte Carlo simulation during inference. Following this, Figure 4.12 depicts 
the differences in training time (4.12a), inference time (4.12b), memory usage 
(4.12c) and accuracy (4.12d) in order to reveal the impact of this design choice 
on these aspects. In order to provide a measure for the approximation ac- 
curacy, the error E in terms of the euclidean distance of vectors combining 
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mean and standard deviation values is averaged over all time steps, i.e. 


5 
1 2 x c. AANT T 
E= 3 D |(Arr Et,y Ot, x Sty) — (Mex Ht,y Ot,x Oty) l 
t=1 
(4.13) 


Here, * represents estimated values. In this formulation, the error function 
aggregates the square differences of each factor. Although this approach ig- 
nores the actual semantics of the mean and variance values, it should provide 
a viable estimate for the accuracy of all approximated factors, due to their 
common minimum error value and comparable squared value ranges. Note 
that for this comparison, the reference values kt x, Mt,y» Ot,x and Ot y are cal- 
culated from the training dataset, as these are likely to differ slightly from the 
actual ground truth values due to the training data being sampled randomly. 
Further, for the ParticleLSTM, the statistics are provided for an increasing 
amount of particles used for inference. Training and inference of each con- 
figuration is performed 10 times, in order to generate more reliable results. 
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Figure 4.12: Comparison of the N-MDN and the ParticleLSTM focusing on different aspects 
related to training and inference when increasing the number of particles used for 
inference. Inference statistics are provided for 10 repetitions in order to show the 
consistency of the SMC approach. In figures (b) - (d), the red line indicates the N- 
MDN baselines, the ParticleLSTM is compared to. In figures (a) and (d), the green 
diamond markers indicate the mean values. The error in (d) is given by the average 
deviation of the estimated mean and standard deviation values from a reference 
provided by the training data. 


Looking at Figure 4.12, the impact on the depicted factors is as expected. 
Due to moving the multi-step inference into the training phase, more itera- 
tions are required to reach convergence when compared to the ParticleLSTM, 
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which only needs to learn the transition between subsequent time steps (see 
4.12a). On the other hand, multi-step inference is achieved using a single 
pass through the N-MDN, resulting in faster inference. The time required to 
approximate Gr using the ParticleLSTM scales linearly with the number of 
particles N, (see 4.12b). At the same time, the memory usage grows with an 
increasing Np. Looking at Figure 4.12c memory usage even grows superlinear 
due to the heavy computations being implemented to run on the GPU, where 
vectorization is required. This, in turn, has higher memory demand, especially 
for parallelized particle re-sampling. As expected, Figure 4.12d shows that the 


accuracy of the approximation increases with higher n,, surpassing the ac- 


P’ 
curacy of the N-MDN approximation at some point. Ultimately, it depends 
on the specific use case whether faster but slightly less accurate or slower but 


more accurate inference is more important. 


4.2 Proof of Concept: Recurrent N-Curve 
Mixture Density Networks 


In order to provide a proof of concept for the conceptual extension of the N- 
Curve model, the meta-time N-Curve model (see Section 3.2), an approach 
which is capable of generating a steady stream of N-Curve segments is re- 
quired. Additionally, dependencies between subsequent segments along a 
generated probabilistic spline need to be taken into account, especially in C! 
or C? continuity cases. Following this, an autoregressive approach is well- 
suited for this implementation. With combinations of recurrent neural net- 
works and Mixture Density Networks being a state-of-the-art sequence model 
(see also Section 2.2.2), an LSTM network will be combined with the V-MDN 
(see Section 4.1) for this proof of concept. The resulting model is denoted as 
recurrent N-MDN. The recurrent N-MDN operates on the meta-timeline and 
targets the generation of an endpoint-connected probabilistic spline. This re- 
stricts the timeline mappings presented in Section 3.2.2 to special cases with- 
out overlapping curve segments. A schematic of the model architecture is 
given in Figure 4.13. 
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Figure 4.13: Schematic of the recurrent N-MDN illustrating the architecture with a loop (left 
side) and unrolled over N steps of meta-time (right side). Outputs a N-Curve seg- 
ment at each meta-time step. 


As illustrated, the same model input v is used in each meta-time step. The 
reason for this is given by the fact, that feeding generated N-Curve segments 
back into the model would require a way of encoding an N-Curve into v. In- 
stead, by using constant input, the model relies on its recurrent connection 
for evolving its output over time. As a technical detail, the model is designed 
to generate a stream of residuals, i.e. Gaussian control point mean vectors are 
always given as offsets to preceding control points. This has multiple advan- 
tages. First, as mentioned in Section 4.1, using residuals instead of absolute 
values is more stable during training and inference, as the target domain is 
more restricted. Second, by defining segment control points in terms of pre- 
vious control points, it is easier to take geometric restrictions into account for 
enforcing C! or C? continuity as required. Regarding the loss function em- 
ployed during training of the recurrent N-MDN, the negative log-likelihood 
as defined in Equation (4.11) in Section 4.1 can be directly translated to this ex- 
tension. This is due to the extraction of sequences from a meta-time V-Curve 
working basically in the same way as in N-Curves. 


In summary, using an autoregressive model provides a straightforward ap- 
proach for implementing the meta-time N-Curve model, with the capability 
of infinite sequence generation. On the downside it should be mentioned, 
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that the calculation of the N-Curve segment at meta-time T requires the cal- 
culation of all preceding segments. This is, however, also necessary inde- 
pendent of the specific approach when C! continuity is required, as it intro- 
duces dependencies between subsequent segments. Although this approach 
re-introduces a notion of iterative generation, constrains imposed by the un- 
derlying parametric curve, which is a spline in this case, remain. Also, with 
every stochastic process mode now being modeled by a probabilistic spline, 
the need for Monte Carlo simulation is still avoided. Thus, the presented ap- 
proach mostly complies with the objectives formulated in Chapter 1. The 


only exception is given by multi-step sequence generation beyond N eg steps, 


seg 
which requires an iterative approach instead of being instantaneous. 


4.2.1 Toy Examples 


This section provides a brief evaluation of the capabilities of the meta-time 
N-Curve model through different toy examples. Similar to the toy examples 
given in Section 4.1, the examples in this section focus on non-conditional 
sequence synthesis. Thus v = Ois set as the constant input of the recurrent N- 
MDN for each meta-time step. This reduces the information processed by the 
recurrent N-MDN to the information passed over time through the recurrent 
connection. Further, N-Curve segments will be defined by 5 Gaussian control 
points. Due to working with discrete-time ground truth stochastic processes 
(see also Section 3.2.3), each segment is defined to cover sub-sequences of 


length N... = 20. 


Three scenarios are considered for comparing the meta-time N-Curve model 
to the original N-Curve model. The targeted discrete-time stochastic pro- 
cesses 9], for each scenario are depicted in Figure 4.14. For training, a set of 
M = 200 realizations is sampled from each stochastic process. The curve-time 
mapping defined in Section 4.14 is used for all examples. 
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Figure 4.14: Ground truth discrete-time stochastic processes Gp with i € {a,b,c} providing dif- 
ferent scenarios for exploring the capabilities of the meta-time N-Curve model. The 
index sets’ are defined as T = Tn-sg for scenarios (a) and (b), and T = TN=96 
for scenario (c), thus covering sequences consisting of 58 and 96 elements, respec- 
tively. While Gt is subject to varying variance between time steps, 3 and Gt have 
constant variance. For all figures, Gaussian random variables along the stochastic 
process’ mean curve are given with corresponding 20 covariance ellipses. 


* The discrete index set notation Ty follows the definition provided in Section 3.2.3. 


The first scenario covers the case of a stochastic process with a complex mean 
curve, in terms of length and shape. Following this, a parametric curve re- 
quires an increased number of control points to approximate the sequence 
properly. For this example, a fixed interval mapping with a = 1 is used 
and C! continuity is enforced. With the training dataset consisting of se- 
quences of length N = 58 and each segment of the meta-time N-Curve cov- 
ering N,.g = 20 elements, the recurrent N-MDN will generate 3 segments. 
Thus, the segmented curve is defined by a total of 13 control points due to 
subsequent segments having one control point in common. Following this, 
the N-Curve model in comparison is defined with an equal number of 13 
control points. The estimated original and meta-time N-Curves generated by 


a respective N-MDN and recurrent N-MDN are depicted in Figure 4.15. 
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Figure 4.15: Approximations of $4 using an N-Curve (a) and a meta-time N-Curve (b) as gen- 
erated by an N-MDN and a recurrent N-MDN trained with noisy sequence data. 
For the meta-time N-Curve, a fixed interval mapping is used and C! continuity is 
enforced. In (b) N-Curve segments are highlighted by using different colors. 20 
covariance ellipses are provided for all Gaussian random variables along the mean 
curve. 


While the meta-time N-Curve approximates the mean curve perfectly, the 
estimated N-Curve deviates slighty at around x = 2 and x = 4, averaging 
out a curved shape. A slight over-estimation of the variance at the begin- 
ning and end of the approximation can be observed for both models. Besides 
both models performing quite similar in their generated approximation, the 
N-MDN took 8 times more iterations to reach convergence compared to re- 
current N-MDN. This observation can most likely be attributed to single N- 
Curves of higher degree being harder to fit to given data due to the global 
control property. 


The second scenario covers a stochastic process with its mean curve including 
sharp edges. In general, such curves cannot be represented by Bézier curves 
due to their smoothness property. Using a segmented curve, on the other 
hand, allows such edges by only targeting C° continuity. Additionally, by 
using segments of lower degree, Gibbs phenomenon [Jer13] can be circum- 
vented. Apart from the training dataset, the setup for this scenario is similar 
to the first scenario. The resulting approximations are depicted in Figure 4.16. 
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Figure 4.16: Approximations of gb. using an N -Curve (a) and a meta-time N-Curve (b) as gen- 
erated by an N-MDN and a recurrent N-MDN trained with noisy sequence data. 
For the meta-time N-Curve, a fixed interval mapping is used. No smoothness con- 
strains are applied. In (b) N-Curve segments are highlighted by using different 
colors. 20 covariance ellipses are provided for all Gaussian random variables along 
the mean curve. 


As expected, the estimated N-Curve is unable to replicate the target mean 
curve, but still provides a close approximation. As the N-Curve is of higher 
degree, Gibbs phenomenon is quite noticeable in this example. Besides minor 
fluctuations in the variances, the estimated meta-time N-Curve is accurate 
with respect to its mean curve. 


The third and final scenario regards a stochastic process whose mean curve 
follows a sine wave. Because of the periodicity, a modulo reset mapping with 
a = land k = 2 will be used for the meta-time N-Curve. Further, C! con- 
tinuity is enforced. Note that k can only be assigned an appropriate value 
due to knowledge about the structure of the targeted mean curve. As such, 
G$ can be approximated with a meta-time N-Curve, which repeats the same 
two segments as many times as required. Following this, an N-Curve with 
9 control points will be estimated for comparison. The results are depicted 
in Figure 4.17. 
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Figure 4.17: Approximations of 97 using an N-Curve (a) and a meta-time N-Curve (b) as gen- 
erated by an N-MDN and a recurrent N-MDN trained with noisy sequence data. 
For the meta-time N-Curve, a modulo reset mapping is used and C! continuity is 
enforced. In (b) N-Curve segments are highlighted by using different colors. 20 
covariance ellipses are provided for all Gaussian random variables along the mean 


curve. 


Looking at the estimated N-Curve first, the approximation of 97 is quiet ac- 
curate apart from the beginning and ending portions. On the other hand, the 
meta-time N-Curve alternates between the two learned segments in order to 
achieve a close approximation of 97. Further, in such scenarios, the curve 
could be continued indefinitely, as indicated in Figure 4.17b (purple curve). 


4.2.2 Handling underdetermined areas 


As the meta-time N-Curve model is based on segmented curves being calcu- 
lated iteratively using an autoregressive approach, the existence of underde- 
termined areas provides an aspect worth discussing. Such underdetermined 
areas are defined as segments within a meta-time N-Curve, which are not 
well estimated during training. The main causes for this are given by either 
areas being sparely covered by the training dataset or insufficient model ca- 
pacity. Besides the model output within these segments being less stable and 
reliable in an application context, it also affects subsequent segments due to 
error propagation in the autoregressive model structure. As a general ap- 
proach for coping with underdetermined areas, a fallback mechanism can be 
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integrated into the model. As such, the sequence model can rely on a basic, 
domain-specific sequence model, which covers regions of high model uncer- 
tainty. Besides the need of a potentially handmade fallback model, the model 
uncertainty needs to be measured using such an approach. Following this, a 
brief overview of techniques for measuring model uncertainty is given. An 
evaluation of the practicality of the presented techniques for the meta-time 
N-Curve is thereby left out, being beyond the scope of this thesis 


A prominent approach to measuring model uncertainty is given by transform- 
ing a network into a Bayesian neural network (see also Section 2.2.1), which is 
implemented using Monte Carlo Dropout [Gal16, Gal17, Ken17]. In these vari- 
ants of Bayesian neural networks, dropout [Sri14] is applied in conjunction 
with multiple passes through the network in order to generate a distribution 
over the network’s output. This distribution can then be used to measure the 
model’s uncertainty by correlating it to the variance in the generated distri- 
bution. A downside of employing such an approach is that a given sequence 
model needs to be transformed into a Bayesian neural network, thereby also 
inheriting their potentially unwanted properties and problems. An alterna- 
tive to Bayesian neural networks is given by Prior Networks [Mal18]. While 
Bayesian neural networks implicitly model distributional uncertainty, Prior 
networks provide an explicit model for model uncertainty. This is achieved 
by parameterizing a prior distribution over predictive distributions. Thus, the 
Prior network approach also requires changes to a given model. Opposed 
to that, an ensemble approach [Lak17, Hua17] can be pursued, avoiding the 
need to change the model at hand. Here, an ensemble of the same model is 
trained. As the training process itself is usually stochastic, the resulting en- 
semble consists of several models generating slightly different outputs for the 
same input. As such, using the entire ensemble, a distribution similar to that 
of a Monte Carlo Dropout Bayesian neural network can be generated. 


4.3 Summary 


Overall, this chapter first provided a detailed introduction to Mixture Den- 
sity Networks, focusing on their general structure and how their output is 


81 


4 Proposed Implementation 


generated. Following this, N-Curve Mixture Density Networks were defined 
as a regression-based implementation for the N-Curve model, which enables 
multi-step inference only requiring a single forward pass through the net- 
work. Using synthetically generated data, several toy examples show the 
model’s capability of learning stochastic control points from noisy sequence 
data and explore the model’s behavior and capabilities under different circum- 
stances. Finally, a comparison with an SMC-based approach was performed, 
depicting the advantages of N-Curve Mixture Density Networks during in- 
ference in terms of memory usage and inference time. 


Additionally, a proof of concept for an implementation of the meta-time N- 
Curve model was presented. The presented model relies on an autoregressive 
structure in order to enable the representation of infinite stochastic processes. 
In comparison to N-Curve Mixture Density Networks, toy examples on syn- 
thetically generated data indicate greater flexibility in terms of modeling capa- 
bilities at the cost of requiring a more complex neural network model, which 
is more expensive in terms of computation time during inference. 
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This evaluation focuses on the N-Curve model and the N-MDN implemen- 
tation as proposed in Sections 3.1 and 4.1. For this, the N-Curve model is ap- 
plied to two sequence prediction tasks. In both tasks, evaluated models need 
to represent a stochastic process describing Nops + Npreq time steps, such that 
given Nobs observations of a process realization, the remaining N,,.q steps 
can be inferred. 


In the first part of the evaluation, the N-Curve model is applied to the task of 
human trajectory prediction (Section 5.1). On the one hand, this task provides 
easy to interpret and visualize results. As such, it gives a good foundational 
evaluation of the general viability of the model. Further, although being sim- 
ple in terms of data dimensionality, the task provides a lot of complexity with 
human trajectory prediction being a highly multi-modal problem. In this re- 
gard, human trajectory prediction provides an appropriate task for evaluating 
the capabilities of the model. 


Following the evaluation of the viability and capabilities of the N-Curve 
model, its claimed capability of being scalable to arbitrary dimensions is 
assessed. For this, the model is applied to the task of human motion predic- 
tion. This task provides a high-dimensional example, being concerned with 
modeling sequences of human poses (Section 5.2). 


It is worth mentioning, that the meta-time N-Curve model (Section 3.2), and 
thus the recurrent N-MDN (Section 4.2), are excluded from this evaluation. 
There are two main reasons for this. First, the toy examples in Section 4.2 
indicate that the meta-time N-Curve model gains an edge over the original 
N-Curve model for very long sequences only. However, common evaluations 
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conducted on real-world data are usually restricted to rather short time hori- 
zons, i.e. short sequences. Second, the meta-time N-Curve model is first and 
foremost a conceptual extension to the N-Curve model, lifting some more 
domain-specific limitations. 


For convenience, the following notation, extending on the notation used in 
previous chapters, is introduced for the scope of the evaluation. A dataset is 
denoted as D = {X),..., X m} and consists of M sequences of fixed length N = 
N obs + Nprea with X; = i, u xy }. Each dataset can further be divided into a 
training and test dataset, such that D = ® rain UD test With Drain AD test = Ø. 
Finally, each sequence X; in a dataset ® is divided into an observed X; obs 


(Nobs time steps) and target Y; (N 


pred time steps) portion 


Xi = Ko U Yi 


i i i i 
{ vr Nobs’ Nobs +1? Nobs + N pred 
Se ce! 


observation target 


b 


i i i i 
{ Le? No YP YN ved 
a ee SS ERBE E00; 


observation target 


where the target portion is to be predicted. 


5.1 Long-term Human Trajectory Prediction 


With the emergence of autonomous driving and advances in the field of au- 
tomated video surveillance, the task of human trajectory prediction gained 
a significant amount of research interest in recent years. A trajectory is de- 
fined as a sequence of locations in a regarded scene, with some velocity pro- 
file attached to it. Predictions are then performed on sequences consisting 
of subsequent 2D image coordinates or 3D world coordinates, generated by 
e.g. a detection-tracking pipeline. Generally speaking, human trajectory pre- 
diction can be subdivided in a number of more specific tasks, depending on 
the time horizon for prediction, the point of view of recording and camera 
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motion. Each of these aspects impacts observations, and such the respective 
prediction models, in a different way. 


Time Horizon: In autonomous systems, the prediction task can be divided into 
short-term (0.5 up to 2 seconds) and long-term (up to 20 seconds) trajectory 
prediction [Rud20b]. While short-term predictions are mainly used for im- 
mediate decisions, such as collision avoidance, long-term prediction impacts 
the long-term behavior of an autonomous system, e.g. by influencing its path 
planning component. Considering short-term prediction, linear models com- 
bined with local collision avoidance approaches, e.g. the social force model 
[Hel95] or Optimal Reciprocal Collision Avoidance (abbrev.: ORCA, [Van11]), 
are generally well suited. In the context of human trajectories, ORCA yields 
more realistic motion patterns [Kot21]. In long-term trajectory prediction, 
the trajectory shape is greatly influenced by the surrounding static environ- 
ment and interactions with other pedestrians. The extent of influence is highly 
dependent on the ground resolution and annotation rate of a given dataset 
[Hug21], as well as the pedestrian density. 


Point of view: Most commonly, trajectory prediction datasets are recorded 
from a bird’s eye view (top view), an elevated viewpoint with a tilted camera 
(tilted view) or from a camera positioned on the ground, e.g. mounted to a 
car (frontal view). While top view and surveillance datasets yield complete 
trajectories, occlusions occur frequently in frontal view datasets. As a conse- 
quence, prediction models need to be able to cope with missing inputs when 
working with frontal view datasets. In addition, constant velocity trajectories 
are distorted in frontal view datasets due to perspective distortion. 


Camera motion: For top view and surveillance datasets, static cameras are a 
common choice. As such, recorded trajectory data complies with the static ob- 
served scene, potentially resulting in decision points, e.g. junctions, at specific 
locations. With frontal view datasets, identical trajectories change in shape 
with the ego-motion of the camera, when mounted to a car. In such cases, 
datasets are usually transformed into an ego-motion compensated reference 
frame (e.g. [Sch13]). 
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Following this, the following evaluation focuses on long-term trajectory pre- 
diction using bird’s eye view data, giving a suitable task for learning-based se- 
quence prediction models. Given an observed trajectory Kops = {X1, + XN,,,5 


consisting of Nops observed positions of a person, the subsequent N 


pred future 


positions need to be predicted. 


5.1.1 Dataset Overview 


With the rising interest in the topic of human trajectory prediction, a number 
of datasets has emerged. These datasets are most often created from annotated 
videos, recorded from a specific point of view. An overview of commonly 
used human trajectory datasets, categorized by the respective point of view, 
is given in Table 5.1. 
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Table 5.1: Non-exhaustive list of datasets appropriate for human trajectory prediction. Note 
that the list of top view datasets also includes datasets providing real-world positions 
instead of image coordinates. As data of persons moving on a flat plane is recorded, 
these datasets are similar to top view datasets, except trajectory points are given ina 
3D reference frame with constant elevation. 


Point of View Datasets 
BIWI Walking Pedestrians [Pel09] 
Crowds by Example [Ler07] 
Stanford Drone Dataset [Rob16] 
Edinburgh Forum [Maj09] 
inD Dataset [Boc19] 
Thor Dataset [Rud20a] 
CITR [Yan19] 
DUT [Yan19] 
Grand Central [Yi15] 
PETS 2009 [Fer09] 
Tilted view VIRAT [Oh11] 
Town Center [Ben11] 
WILDTRACK [Cha18] 
KITTI [Gei13] 
nuScenes [Cae20] 
JAAD [Ras17b, Ras17a] 
Daimler [Sch13] 


Top view 


Frontal view 


In the context of long-term human trajectory prediction, top view and sur- 
veillance datasets are preferred due to the lack of occlusions and perspective 
distortions. In addition, data recorded from a static scene imposes a structure 
onto the dataset, which yields well-defined walking paths and decision points, 
exposing the multi-modal nature of human trajectory prediction. 


Finally, the most commonly used datasets for long-term human trajectory pre- 
diction include the BIWI Walking Pedestrians (abbrev.: biwi), Crowds by Ex- 
ample (abbrev.: crowds) and the Stanford Drone (abbrev.: sdd) datasets. These 
datasets further consist of 2, 4 and 8 scenes, respectively. In the following, 
these scene datasets will be referred to as dataset:scene, e.g. biwi:eth. In the 
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case of the Stanford Drone Dataset, multiple, partially overlapping’, record- 
ings of the same scene are provided. The specific recording will be indicated 
by a number added to the scene dataset abbreviation, e.g. sdd:hyang00. 


5.1.2 State-of-the-art Human Trajectory Prediction 
Models 


Looking at state-of-the-art deep learning-based sequence prediction models 
for long-term human trajectory prediction, these models can be divided into 
aggregating and holistic models. Holistic models, on the one hand, model the 
entire observed scene including all pedestrians by using a spatio-temporal 
graph network, where each object in the scene is a unique node (e.g. [Moh20, 
Sal20]). Opposed to that, the more prevalent aggregating models have sep- 
arate processing pipelines for each type of input, which are fused together 
at some point. For this class of models, a modular meta-architecture revolv- 
ing around an underlying base sequence model can be defined, covering the 
main components of each model. Additional types of inputs, also referred 
to as additional cues, are discussed later in this section. A schematic of this 
meta-architecture is depicted in Figure 5.1. 


* In the sense of the observed real-world scenery. 
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Figure 5.1: Schematic of a meta-architecture for aggregating trajectory prediction models. Such 
models contain at least some sequence model and optional building blocks for pro- 
cessing additional cues, such as social or environmental context. 


Each aggregating prediction model at least consists of a base sequence model, 
which encodes input trajectories, the observation, and generates either sin- 
gle trajectories or probabilistic predictions. Taking a range of state-of-the-art 
deep learning-based prediction models into consideration, these models can 
be boiled down to few base sequence models. An overview of commonly used 
base sequence models is depicted in Table 5.2. Note that due to the existence 
of many similar models, only representative examples for each base sequence 
model are featured. For a comprehensive overview of existing human trajec- 
tory prediction approaches, the reader may be referred to recent surveys, e.g. 
[Rud20b]. Further, no distinction is made for variants of the same base model, 
as these most commonly only differ slightly. Finally, endpoint-conditioned 
prediction models (e.g. [Kit12, Man20]) are excluded from this overview, as 
the endpoint is assumed to be unknown in the context of this evaluation. 
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Table 5.2: Overview of commonly used base sequence models in human trajectory prediction 
alongside representative prediction models. The most frequently used sequence mod- 
els are given by Recurrent Mixture Density Networks (abbrev.: R-MDN, [Gra13]), vari- 
ants of Generative Adversarial Networks (abbrev.: GAN, [Goo14]) and Variational Au- 
toencoders (abbrev.: VAE, [Kin14]) combined with a sequence to sequence model 
[Sut14], as well as Transformers [Vas17]. Transformers are only recently being studied 
in the context of human trajectory prediction. It could be noted, that Temporal Con- 
volutional Networks (abbrev.: TCN, [Oor16, Bail8]) are excluded from this overview, 
as these are rarely used and yield similar performance to LSTM networks [Bec18]. 


Additional cues 
Base Model Model 


Social Environmental 


x 


Social LSTM [Ala16] 
R-MDN ParticleLSTM [Hug18] 
Social Attention [Vem18] 
‘Social GAN [Gup18] E 
GAN SoPhie [Sad19] 
Social Ways [Ami19] 
‘DESIRE [Lee17] 
VAE LSTM-BMS [Bha18] 
DAG-Net [Mon21] 
“STAR [Yu20] 
Transformer TF [Giu21] 

AgentFormer [Yua21] 


S 


x x 


\ x Ns 


LRN LENIN NNN 


xx xxx 


With reference to the introductory section on sequence modeling (see Chapter 
2), each of the base models listed in Table 5.2 provides certain benefits for the 
task of human trajectory prediction. R-MDN and Transformer models on the 
one hand are purely regression-based and thus easier to train. Additionally, 
these models can be used to output an explicit probability distribution over fu- 
ture trajectories by parameterizing a Gaussian mixture model. This, however, 
comes at the cost of a more difficult approach to generate multi-modal pre- 
dictions. When parameterizing a Gaussian mixture model, the model can for 
example be embedded into a particle filter cycle [Hug18]. Another approach 
construes the trajectory prediction problem as a classification task, where pos- 
sible future predictions are covered by different classes [Giu21]. Opposed to 
that, VAE and GAN are probabilistic models providing an implicit model of 
the data distribution. Both models employ a generator network processing a 
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stochastic input in addition to the encoded observation. As a consequence, 
these models provide a straightforward approach to generating multi-modal 
predictions by sampling. 


In recent years, an increasing number of approaches emphasize the use of ad- 
ditional cues. The most common additional cues are given by the social con- 
text, ie. neighboring pedestrians, and the environmental context, i.e. static 
scene elements. Established approaches for incorporating social context are 
commonly based on either grid-based pooling (e.g. [Alal6, Gup18]), graph 
attention (e.g. [Vem18, Kos19, Hua19]) or graph convolution (e.g. [Moh20]). 
Environmental context, on the other hand, is usually given by an encoding 
of some reference image or video frame generated by a Convolutional Neural 
Network (CNN). 


As the N-Curve model introduced in this thesis provides an alternative model 
for the underlying base sequence model, such additional cues will not be 
considered in the following quantitative and qualitative evaluation. Conse- 
quently, the performance of the N-Curve model is compared with the afore- 
mentioned base sequence models. It should be noted, that when taking away 
the additional cue components, most state-of-the-art models collapse onto 
their underlying base sequence models, thus justifying a comparison based 
on these base sequence models. 


5.1.3 Evaluation Setup 


In order to provide a comprehensive evaluation of the N-Curve model in the 
context of long-term human trajectory prediction and in comparison with 
commonly used sequence models, the tasks of unimodal and multi-modal tra- 
jectory prediction are considered. Therefore, the current standard approach 
to evaluation in the literature is extended by using additional datasets and 
performance measures, as it does not cover the task of multi-modal trajectory 
prediction. Further, the evaluation will be performed on each selected dataset 
in isolation, due to the removal of additional cues for the sequence models. 
Without such additional cues, long-term prediction requires well-defined de- 
cision points tied to static locations in the observed scene, in order to capture 
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relevant walking paths. This is especially true for non-goal-driven prediction 
approaches, as regarded in this evaluation. As a consequence, pooling to- 
gether unrelated datasets into a common reference frame cannot be justified 
and thus datasets are evaluated in isolation. 


5.1.3.1 Selected Datasets 


Under these conditions and in compliance with the current standard eval- 
uation approach, the biwi:eth, biwi:hotel, crowds:zara01 and crowds:zara02 
datasets are selected. The crowds:students dataset is left out, as it focuses 
heavily on human-human interaction and as such does not provide well- 
defined walking paths or decision points. As these datasets provide rather 
simple scene geometry, additional scenes are taken from the Stanford Drone 
Dataset. In order to keep the evaluation more concise, scene datasets with 
varying complexity [Ami20, Hug21] are considered. Thus, the sdd:bookstore03 
and sdd:hyang00 datasets are included in the evaluation. For these datasets, 
only pedestrian trajectories are considered’. Table 5.3 and Figures 5.2 and 5.3 
depict samples from the datasets and relevant statistical details. 


Table 5.3: Statistical details of human trajectory datasets selected for evaluation. It should be 
noted, that the number of trajectories can deviate from those given in the literature, 
as trajectories lying outside the image boundary after projection are dismissed. The 
trajectory length denotes the number of points defining a specific trajectory. 


Image . . Average 
Dasaser Resolution # Trajectories Trajectory Length 
biwi:eth 640x480 354 15.47 +8 
biwi:hotel 720x576 378 17.22 + 12.16 
crowds:zara01 720x576 128 34.27 + 17.11 
crowds:zara02 720x576 204 47.09 + 72.30 
sdd:bookstore03 1322x1079 260 42.37 + 40.14 
sdd:hyang00 1455x1925 285 60.39 + 44.99 


* The Stanford Drone dataset provides trajectory data for a multitude of different agent types, 
including for example pedestrians, bikers and skateboarders. 
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Figure 5.2: Overview of human trajectory datasets selected for evaluation. Sub-figures depict a 
reference image ofthe recorded scenery (left) and the overlayed dataset (right). Note: 
For illustration purposes, the image and data scale is aligned for all datasets, for the 
actual image resolutions see Table 5.3. 
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(c) sdd:hyang00 


Figure 5.3: Overview of human trajectory datasets selected for evaluation. Sub-figures depict a 
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reference image of the recorded scenery (left) and the overlayed dataset (right). Note: 
For illustration purposes, the image and data scale is aligned for all datasets, for the 
actual image resolutions see Table 5.3. 


5.1 Long-term Human Trajectory Prediction 


Preprocessing: At first, all datasets originally provided in world space coordi- 
nates are projected into image space using homographies. In the literature, 
the annotation frequency of the datasets is usually set to 2.5 annotations per 
second, which equals to the annotation rate of the BIWI Walking Pedestrians 
dataset. Thus, the annotation frequency of all datasets included in the evalua- 
tion is adjusted accordingly. Further, the evaluation is conducted on trajecto- 
ries ofa fixed length N = Nobs +N preg (see also Section 5.1.3.5). Following this, 
all (sub-)trajectories of a given length are extracted from each respective data- 
set in order to provide training and test datasets. Trajectories shorter than the 
given length are not considered for evaluation. As a final data preprocessing 
step, trajectories of non-moving or slow-moving persons are filtered out, as 
statistical models are worse in modeling trajectories of slow-moving persons, 
because their behavior becomes less predictable [Has19]. Thus, the dataset- 
dependent required minimum speed’ is calculated heuristically for a given 
dataset ® containing all M possible (sub-)trajectories of length N: 


max; Mspeea (i) = min; Msyeed(i) 
Smin = M 
with 


(5.1) 


1 N 
Mspeea (i) = N-1 > x; = x; ll 
t=2 


Here, M,yee4(i) denotes the average speed within the i’th trajectory X; E D. 


* The average euclidean distance between subsequent trajectory points. 
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5.1.3.2 Performance Measures 


In the standard evaluation approach, the designated performance measures 
are given by the Average Displacement Error (abbrev.: ADE) and the Final Dis- 
placement Error (abbrev.: FDE), defined as 


M Nored 


ADE = FNC > > I -ylk (5.2) 
Nored j= 1 t=1 
and 
is i i 
FDE = M 2 IEN ea = YN, a ll2 (5.3) 


for a given prediction horizon N pred a set Y = {Y,,..., Ym} of M ground 
truth trajectories Y; = {y}; YN ead and corresponding predictions Y; = 
Si, os Freak generated by a given prediction model. The ADE is then defined 
by the average L2 distance between the ground truth and a corresponding pre- 
dicted trajectory, while the FDE is defined by the L2 distance between the final 
ground truth and predicted trajectory points after the prediction horizon. In 
the case of probabilistic sequence models, which generate a predictive dis- 
tribution p(yq...., Nore SIR... Nach» Yi corresponds to a maximum likelihood 
prediction given the probabilistic output of the model. 


As the ADE and FDE do not provide an adequate measure for assessing the 
quality of (multi-modal) probabilistic predictions, another performance mea- 
sure is required for this case. Due to the actual ground truth probability dis- 
tribution for each time step being unknown, a common choice is given by the 
Negative (data) Log-Likelihood (abbrev.: NLL, e.g. [Bha18, Iva19]), defined as 


M Norea 


NLL = X > -logpyil. (5.4) 


UM None Npred {1 121 
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Here, pty' |) denotes the predictive distribution for the t’th trajectory posi- 
tion as generated by the probabilistic sequence model. Note that the con- 
ditional part of this distribution is not given explicitly, as it varies between 
different models (see Section 5.1.3.5). It is worth mentioning, that sometimes 
an oracle measure (e.g. [Lee17]) is used as a sample-based substitute for the 
NLL. This measure does, however, introduce another hyperparameter, which 
is why the NLL is preferred in the context of this evaluation. 


5.1.3.3 Baselines 


In order to provide reference values for comparison, a simple baseline is given 
for each performance measure. In the case of the ADE and FDE, a simple pre- 
diction model is given by a linear extrapolation calculated from a respective 
observed trajectory. Here, the relative offset 6; = xy — Xy 

obs obs 
most recent observations is projected N req steps into the future, as these po- 


_, of the two 
sitions are assumed to have the most impact on the future trajectory [Sch20a]. 
In the case of the NLL measure, a sample-based prediction can be generated 
for each future position by using a shotgun approach [Paj18]. In this approach, 
multiple future trajectories are generated by randomly altering the direction 
and scale of the relative offset 6; before projection. The altered offset for each 
future trajectory is then given by Rg - 6; -s witha ~ N(0,0,),5 ~ N(1,0,) 
and the matrix Rg describing a rotation by a degrees. This yields a unimodal 
probabilistic prediction with a fixed variance for each predicted time step. In 
the following, Oy = 15° and o, = 0.1 are used. An exemplary prediction 


using both approaches is depicted in Figure 5.4. 
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(a) Linear extrapolation (b) Shotgun 


Figure 5.4: Exemplary predictions generated by a linear extrapolation model and a shotgun 
model. Predicted samples generated by the shotgun model around the mean lin- 
ear prediction (green) are depicted in blue. 


In addition to these two baselines, a simple LSTM baseline is provided. This 
mainly has two reasons. On the one hand, the LSTM model is an integral com- 
ponent of multiple sequence models included in the evaluation. On the other 
hand, it is a widely used baseline next to the linear extrapolation approach. 


5.1.3.4 Implementation Details 


This section gives a brief overview on implementation details for the se- 
quence models in comparison. The implementations are based on existing 
approaches, which provide a publicly available implementation. These im- 
plementations are adapted to use a common data pipeline. If necessary, 
components for processing additional cues, such as social context, are re- 
moved. The list of approaches the implementations are based on alongside 
adaptations made is given in Table 5.4. 
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Table 5.4: Sequence model implementations adapted for use in this evaluation. 


Model Based on Adaptations 
ParticleLSTM [Hug18], 

R-MDN : : 
Own implementation 

CAN Social GAN [Gup18], Removed social context 
Original implementation’ Data pipeline 

UE LSTM-BMS [Bha18], PyTorch re-implementation 
Original implementation” Data pipeline 

Transformer TF [Giu21], oe 

Data pipeline 


Original implementation‘ 


* https://github.com/agrimgupta92/sgan 
? https://github.com/apratimbhattacharyya18/CGM_BestOfMany 
© https://github.com/FGiuliari/Trajectory- Transformer 


The remainder of this section provides some implementation details regard- 
ing the prediction models in comparison, including the N-MDN. For each 
model, respective loss functions, training details and the type of output as 
generated by the model is depicted. Additionally, a simplified structure illus- 
tration is given for each model. These illustrations also serve the purpose of 
highlighting relevant hyperparameters of each model. The values chosen for 
each hyperparameter and relevant general information is given at the end of 
this section. 


N -MDN: For the task of human trajectory prediction, the N-MDN is set into 
a conditional setting. Thus, the MDN’s input vector v (see Section 4.1) needs 
to hold information about the observed trajectory, in order to condition the 
MDN’s output upon the observation. In accordance with a wide range of 
human trajectory prediction models, an LSTM network is used for encoding 
the observed trajectory. The conditional N-MDN is illustrated in Figure 5.5. 
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Figure 5.5: Simplified illustration of the conditional N-MDN. Relevant hyperparameters de- 
picted in blue are given by the hidden state dimension d.„. of the LSTM encoder, 
as well as the number of N-Curves N..urves Output by the MDN. Each generated N- 
Curve is defined by Ny, Gaussian control points. 


As N -Curves model entire trajectories, an N-Curve can be used to either only 
model the future trajectory, or to model the observed trajectory together with 
the future trajectory. Both options will be considered in the evaluation. 


R-MDN: The SMC-based R-MDN variant used in this evaluation belongs to the 
group of 1-to-1 sequence models (see Section 2.1.1), processing one trajectory 
point at a time. As such, the model takes a discrete trajectory point as input 
and outputs the parameters of a Gaussian mixture distribution modeling the 
next trajectory point. In order to enable the model to generate a multi-modal 
prediction, multiple points are sampled from the output distribution and fed 
back into the model. To prevent exponential growth of samples, subsequent 
output distributions are combined and re-sampled [Hug18]. A schematic of 
this model is given in Figure 5.6 


> x >O L LSTM —> MDN 4 AO) |__| & 
Observation Prediction Samples 
[distm] [Acomps] 


Figure 5.6: Simplified illustration of the SMC-based R-MDN. Relevant hyperparameters depicted 
in blue are given by the hidden state dimension djstm of the LSTM network and the 
number of mixture components N.omps generated by the MDN. 


During training, the commonly used teacher forcing approach (see Section 
2.1.1) is used, as the model generates its prediction sequentially. With the 
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model generating a sequence of conditional mixture distributions, the opti- 
mization is based on the negative log-likelihood loss 


Nobs +N pred 


LS > = log p&|x:-ı, eee xı), (5.5) 


t=2 
for a given training sample trajectory X = {x,..., XNobs+Npred }. 


GAN: For applying a GAN in the context of human trajectory prediction, a 
sequence processing unit must be incorporated into the model. According to 
[Gup18], a sequence-to-sequence LSTM (see Sections 2.1.1 and 2.2.3) is built 
into the generator network and another LSTM encoder is built into the dis- 
criminator network. The GAN encodes the observed trajectory and then adds 
a random noise vector to the encoded representation in order to sequentially 
generate a prediction. By performing multiple passes through the decoder 
network using different noise vectors, a sample-based distribution of future 
trajectories is generated. A simplified illustration of the GAN is depicted in 
Figure 5.7. 
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Figure 5.7: Simplified illustration of the GAN. Relevant hyperparameters depicted in blue are 
given by the dimensionality of the noise vector dyoise, the hidden dimension of the 
generator’s LSTM encoder denc and decoder ddec, as well as the discriminator’s LSTM 
encoder daiscr enc and feed forward network dg. The discriminator part (dashed 
boxes) are only used during training. The noise vector z is sampled from N(0,J). 


Opposed to the R-MDN, an auto-conditioning approach (see Section 2.1.1) is 
employed during training. For the loss calculation, K samples {4},..., ¥x} 
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with Y; = Si, IN rea} are generated. The loss function consists of a va- 


riety loss 
N pred ’ 
L variety = min > Iy; = Yılla (5.6) 
t=1 


combined with the GAN adversarial loss 


£= “Y~ Paata (y) [log D(y)] + -z~ p(z) [log — D(G@)))] . (5.7) 


D and G denote the discriminator and generator networks, respectively. The 
variety loss is intended to encourage the GAN to generate diverse future tra- 
jectory predictions for the same observed trajectory. 


VAE: Similar to the GAN extension, a sequence-to-sequence LSTM is built into 
the VAE in order to enable sequence processing. Further, prediction genera- 
tion works similar to the GAN model by adding a random vector to an encoded 
representation of an observed trajectory in order to generate multiple future 
trajectories. A schematic of the VAE is depicted in Figure 5.8. 
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Figure 5.8: Simplified illustration of the VAE. Relevant hyperparameters depicted in blue are 


given by the dimensionality of the latent space jatent and of the hidden state of the 
LSTM encoder dence and decoder ddec. In addition, the hidden state dimension djsim 
of the auxiliary LSTM encoder only active during training gives another hyperpa- 
rameter. The random vector z is sampled from N(0,I) during inference, while the 
parameters of the Gaussian distribution are determined by the auxiliary LSTM en- 
coder during training. 


During training, the LSTM decoder takes the encoded observation with the 
added random vector as input for every prediction step. In this way, neither 
teacher forcing, nor auto-conditioning schemes are necessary. At the same 
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time, the model entirely relies on the progressing internal LSTM state to gen- 
erate an appropriate prediction. Similar to the GAN, K samples are generated 
for the loss calculation. Here, the loss function consists of a variation of the 
VAE’s standard ELBO (evidence lower bound) loss 


L= aa {log PCY| Zi, Xobs)} = log T u Dia (a(z Ol pz Lons) (5.8) 


with z; ~ q(z|X). This best of many samples variant of the ELBO contains a 
variety loss component, comparable to that of the GAN implementation. 


Transformer: Although the implementation chosen for this evaluation does 
not provide a probabilistic prediction model, it is considered in this compari- 
son, as it provides a strong contender to the established LSTM networks built 
into many human trajectory prediction models. It is an attention-based se- 
quence model, consisting of an encoder, which encodes the entire observed 
trajectory into a single vector, and a decoder, which sequentially generates 
one trajectory point at a time, given the encoding. A schematic of this model 
is depicted in Figure 5.9. 


> *OrO | | Embed H) Self- BR Feed > Venc 
Observation a] Attention | | Forward 
model Miraa] [del 
Y 
Encoder- = 
[dmodet] Attention Prediction 
[Nneads] [de] 
[Aneads] 


Figure 5.9: Simplified illustration of the Transformer. Relevant hyperparameters depicted in 
blue are given by the model dimension dj, qe), the number of attention heads Npeads 
and the dimension of the feed forward network dg. The encoder (top) and decoder 
(bottom) networks share the same hyperparameters. In the decoder network, (0) 
denotes a start of sequence token used as input for the initial prediction step. 


Similar to the R-MDN, a teacher forcing approach is used during training. As 
the model generates a single future trajectory, the optimization can be based 
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on the L, loss function 


N pred 
£= PS l¥t — yelle- (5.9) 


t=1 


Hyperparameters: The model’s hyperparameters are determined by running a 
grid search around the parameters provided by the respective authors, using 
those being most consistent across all datasets. For hyperparameters yielding 
similar model results, the parameterization given by the respective authors is 
favored. Output-related parameters for the N-MDN and R-MDN are defined 
separately in Section 5.1.4. A list of chosen hyperparameters for each model 
is given in Table 5.5. 


Table 5.5: Overview of chosen hyperparameters for each model in comparison. 


Model Hyperparameters 
N-MDN dene = 128 
Nepis = 5 
R-MDN dim = 256 i 
O doie = 8 u 
dene = 32 
GAN dice = 48 
Adiser_enc =32 
dg = 64 
oo 7 disn =128 
VAE diaten = 64 
dene = 48 
diec = 48 
a RE 64 
Transformer Meads = 1 
dg = 256 


General Information: All models are trained using a stochastic gradient de- 
scent policy and the ADAM optimizer [Kin15], using either mini-batches or 
the entire training dataset at once (whichever worked best for the respective 
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model). For prediction, K = 300 samples are used for the sample-based pre- 
dictors R-MDN, VAE and GAN. 


5.1.3.5 Evaluation Methodology 


For achieving a reliable evaluation, a k-fold cross-validation is performed on 
each dataset, in order to cope with unfavorable random training and test splits. 
In the following, k = 5 folds are performed, as it gives a good trade-off be- 
tween error bias and variance [Has09]. In compliance with the goal of measur- 
ing the raw single dataset performance, all prediction models are re-trained 
for each fold. As is common practice, prediction models are tasked to pre- 
dict Nprea = 12 steps (4.8 seconds) into the future, given an observation of 


Ns = 8 steps (3.2 seconds). 


For generating a maximum likelihood prediction, the output of the probabilis- 
tic prediction models in comparison need to be processed in different ways. 
For the R-MDN, instead of propagating a set of particles, the mean vector of 
the highest weighted mixture component is fed back into the model in each 
time step. As the GAN and VAE models generate a set of sample trajectories, 
the mean position for each time step is used. Finally, for the N-MDN, the 
mean curve of the N-Curve with the highest mixture weight is used. 


Looking at the NLL measure, which requires a probability density function 
generated by each prediction model, sample-based output is processed by ap- 
plying a kernel density estimation [Sco18] using a Gaussian kernel in order 
to obtain probability density functions for each time step. 


5.1.4 Quantitative Results 


For the quantitative evaluation, multiple output-related configurations are 
considered for the R-MDN and N-MDN models, controlling the number of 
mixture components and the N-MDN model’s output mode (see Section 
5.1.3.4). The configurations are depicted in Table 5.6. 
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Table 5.6: Output-related configurations for the R-MDN and N-MDN models in the evaluation. 


Configuration Description 


R-MDNg Outputs a single-component mixture of Gaussians (N.omps = 1). 

R-MDNp Outputs a 3-component mixture of Gaussians (N.omps = 3). 

N-MDNg Models the observed and future trajectory, and outputs a single 
N-Curve (Neuryes = 1). 

N-MDNp Models the observed and future trajectory, and outputs a mix- 
ture of 3 N-Curves (Neuves = 3). 

N-MDN, Models the future trajectory and outputs a single N-Curve 
(Neurves = 1). 

N-MDNg Models the future trajectory and outputs a mixture of 3 N- 


Curves (Maivads = 3). 


Tables 5.7 - 5.12 summarize the results of the quantitative evaluation, using 
a per dataset 5-fold cross validation and the ADE, FDE and NLL performance 
measures. Accordingly, respective averaged performance values with corre- 
sponding standard deviation considering all 5 folds are depicted. It should be 
noted, that the performance values are not comparable across datasets, due 
to different image and ground resolutions. In order to make values compa- 
rable, datasets would need to be projected into 3-dimensional world space. 
Additionally, a re-sampling of trajectory points can be necessary in order to 
match motion profiles. 
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Table 5.7: Quantitative results of all approaches on the biwi:eth dataset for a prediction time 
horizon of Nyrea = 12 time steps (4.8 seconds). ADE and FDE errors are reported in 
pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 

Linear 21.80 + 1.60 47.81 + 5.41 - 

Shotgun - - 10.07 + 2.37 
LSTM 13.44 + 0.97 26.83 + 2.52 2 
Transformer 19.36 + 2.16 35.81 + 3.67 oy er 
R-MDN, 26.12 + 17.42 46.28 + 32.50 16.62 + 3.67 
R-MDN, 16.11 + 3.70 28.70 + 7.72 2893.14 + 5760.97 
VAE 22.33 +1.82 37.45 + 4.91 8.38 + 0.16 
GAN 11.42 + 2.18 22.26 + 4.48 1084.14 + 988.79 
N-MDN, 9,510.67 17.41 + 1.13 7.254013 
N-MDN, 9.17 + 1.28 17.15 + 2.89 7.30 + 0.30 
N-MDN. 10.23 + 1.07 18.61 + 2.88 7.48 + 0.41 
N-MDNg 9.87 + 1.03 18.28 + 3.40 7.27 + 0.17 


Table 5.8: Quantitative results of all approaches on the biwi:hotel dataset for a prediction time 
horizon of Npreq = 12 time steps (4.8 seconds). ADE and FDE errors are reported in 
pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 

Linear 26.65 + 1.24 51.13 + 3.29 z 

Shotgun - - 9.87 + 1.14 
LSTM 17.91 + 2.16 32.93 + 4.50 3 
Transformer 20.42 + 1.48 34.35 + 2.95 Vee ee 
R-MDN, 24.46 + 4.92 43.52 + 8.28 15.52 + 2.93 
R-MDN, 19.01 + 3.77 33.57 + 6.71 12.20 + 2.42 

VAE 20.03 + 4.14 35.61 + 7.90 8.25 + 0.31 

GAN 15.48 + 1.80 26.38 + 3.15 20115.70 + 38614.18 
N-MDN, 16.64 +3.10 30.36 + 7.56 8.264048 
N-MDN, 15.46 + 2.03 27.28 + 3.75 7.96 + 0.16 
N-MDN, 13.76 + 0.94 23.82 + 1.90 7.86 + 0.18 
N-MDNg 15.30 + 1.98 26.60 + 4.22 7.85 + 0.20 
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Table 5.9: Quantitative results of all approaches on the crowds:zara01 dataset for a prediction 
time horizon of N rea = 12 time steps (4.8 seconds). ADE and FDE errors are reported 
in pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 

Linear 21.69 + 0.40 46.98 + 1.39 - 

Shotgun - - 10.93 + 0.89 
LSTM 23.71 + 9.30 49.93 + 24.18 z 
Transformer 27.30 +1.64 50044267 =- 
R-MDN, 20.15 + 3.25 38.38 + 6.77 14.76 + 4.55 
R-MDN, 16.86 + 0.34 31.18 + 0.72 13.31 + 2.26 
VAE 19.21 + 0.74 35.49 + 1.47 8.24 + 0.19 
GAN 15.59 + 0.41 30.11 + 0.83 363.05 + 456.18 
N-MDN, 18.48 +0.51  35.9740.74 8300.03 _ 
N-MDN;, 19.07 + 0.68 36.49 + 1.40 8.21 + 0.06 
N-MDN, 16.60 + 0.44 32.12 + 0.84 8.04 + 0.04 
N-MDNg 17.71 + 0.43 34.17 + 0.84 7.95 + 0.05 


Table 5.10: Quantitative results of all approaches on the crowds:zara02 dataset for a prediction 
time horizon of Npreq = 12 time steps (4.8 seconds). ADE and FDE errors are 
reported in pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 
Linear 28.08 + 0.33 60.83 + 0.73 : 
Shotgun - - 14.97 + 1.72 
LSTM 34.14 + 22.66 72.47 + 52.74 z 
Transformer 32.17+2.94 A 
R-MDN, 24.92 + 1.98 48.18 + 3.59 11.99 + 1.37 
R-MDN, 21.72 + 2.59 41.19 + 4.65 11.67 + 1.17 
VAE 23.44 + 0.76 43.45 + 1.42 9.29 + 0.27 
GAN 20.01 + 0.72 39.57 + 1.35 26.33 + 13.40 
N-MDNa 22.34+0.92 4.204203 ——8.54+0.07 | 
N-MDNp 24.74 + 1.30 47.29 + 2.90 8.49 + 0.06 
N-MDN, 20.41 + 1.07 40.39 + 2.35 8.33 + 0.07 
N-MDNg 21.41 + 1.03 41.55 + 2.45 7.98 + 0.03 
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Table 5.11: Quantitative results of all approaches on the sdd:bookstore03 dataset for a prediction 
time horizon of Nyrea = 12 time steps (4.8 seconds). ADE and FDE errors are 


reported in pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 
Linear 28.39 + 0.23 60.21 + 0.67 z 
Shotgun - - 18.03 + 4.00 
LSTM 27.91 + 0.95 56.04 + 1.75 : 
Transformer 524841488 97.00 + 26.24 eS oe 
R-MDN, 50.02 + 13.19 93.60 + 23.37 14.28 + 2.55 
R-MDN; 27.58 + 4.64 50.11 + 8.56 23.47 + 23.88 
VAE 30.75 + 1.04 55.93 + 1.83 8.88 + 0.09 
GAN 19.10 + 0.58 35.45 + 0.96 33.18 + 4.71 
N-MDN, D24124 48.28 + 2.80 9.13+0.05 — 
N-MDN; 25.33 + 2.34 48.69 + 5.28 9.06 + 0.09 
N-MDN. 22.31 + 1.15 41.60 + 1.58 8.89 + 0.07 
N-MDN4 19.43 + 0.55 35.37 + 1.10 8.46 + 0.04 


Table 5.12: Quantitative results of all approaches on the sdd:hyang00 dataset for a prediction 
time horizon of Nprea = 12 time steps (4.8 seconds). ADE and FDE errors are 


reported in pixels. Lower is better for all performance measures. 


Model ADE FDE NLL 
Linear 36.25 + 0.79 75.93 + 2.09 = 
Shotgun - - 16.33 + 1.30 
LSTM 40.52 + 9.44 85.83 + 23.63 z 
Transformer 119.70+0.68 222.18 + 1.39 en 
R-MDN, 90.86 + 23.89 168.85 + 43.22 20.43 + 4.03 
R-MDN, 44.18 + 2.32 84.29 + 4.17 13.46 + 0.74 
VAE 41.39 + 2.53 82.21 + 5.21 9.38 + 0.09 
GAN 28.84 + 1.11 57.56 + 2.92 20.48 + 5.26 
N-MDNa 35831.03 72.74 + 2.49 9.82 +0.05 | 
N-MDNp 37.62 + 2.57 76.31 + 6.07 9.80 + 0.07 
N-MDN, 34.30 + 1.11 69.87 + 2.37 9.63 + 0.02 
N-MDNg 29.68 + 1.18 58.69 + 3.13 9.11 + 0.01 
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Model Comparison: It can be seen, that in terms of the average and final dis- 
placement errors (ADE and FDE), the N-MDN performs on par with the best 
performing model in comparison, i.e. the GAN model. At the same time, the 
N-MDN outperforms every other model in terms of the NLL performance 
measure, with the VAE being the closest contender. It should be noted, that 
both, the GAN and the R-MDN model, have a tendency to perform worse in 
terms of NLL, which can be attributed to these model’s weakness to mode 
collapse (see also Section 4.1). This is discussed in more detail in Section 5.1.5. 
Among the N-MDN variants, the models only modeling the future trajectory 
seem to outperform those also modeling the observed trajectory. Further, us- 
ing multiple components appears to be beneficial for more complex datasets. 
This can be expected, as these datasets contain multiple decision points, lead- 
ing to multiple distinct possibilities for future trajectories. Lastly, the Trans- 
former model performs notably worse than the LSTM baseline, which indi- 
cates that the model is not optimal for the specific task at hand in its original 
form and thus may require some adaptations. 


Baselines: As expected, the linear prediction model performs quite well in 
terms of the average and final displacement error. This is due to a substan- 
tial amount of (sub-)trajectories in each dataset, commonly around 50 to 60 
percent [Hug21], representing a constant linear motion. Similarly, the shot- 
gun baseline is only outperformed by 2 out of 4 models, namely VAE and N- 
MDN. This is due to the baseline’s incapability of modeling multiple modes 
as required for more complex cases. Further, the variance of the prediction 
is not adapted to the actual location of the observation in the scene, result- 
ing in under- and overestimation. More sophisticated prediction models not 
suffering from mode collapse are thus able to outperform this baseline. 


Summary: In summary it can be said, that all probabilistic prediction models 
perform similar in terms of the presented performance measures, making the 
choice of model dependent on their respective properties. In this case, the 
N-MDN may be favored over other models due to it being fully regression- 
based and thus more stable during training and inference, while at the same 
time being less computationally heavy during inference. 
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5.1.5 Qualitative Evaluation 


This section provides some insight into the probabilistic models behavior and 
the evaluation methodology itself. Following this, the problem of mode col- 
lapse in R-MDN and GAN models is discussed at first. After that, a qualitative 
comparison of the probabilistic models considered in this evaluation is pro- 
vided. Then, the quantitative evaluation of probabilistic prediction models is 
discussed in more detail, focusing on how to measure probabilistic predic- 
tion quality. Finally, different characteristics of the N-Curve model and its 
N-MDN implementation as discussed in Sections 3.1 and 4.1 are further in- 
vestigated in the application context using real-world data. 


5.1.5.1 R-MDN and GAN: Mode Collapse 


The quantitative evaluation revealed that in some cases the R-MDN and GAN 
models yield large NLL values. While this can be the case because of the 
model generating bad predictions for certain inputs, this can oftentimes be 
attributed to both models being vulnerable to mode collapse (MDN: [Mak19], 
GAN: [Met17]), where the model outputs a narrow prediction due to only 
generating slight variations of the same sample. Figure 5.10 depicts a well- 
spread prediction, next to a bad prediction and a prediction indicating a case 
of mode collapse in order to give a visual example of the latter. In this illus- 
tration, exemplary predictions generated by a GAN trained on the biwi:eth 
dataset are shown. 


111 


5 Evaluation 


(a) NLL: 7.70 (b) NLL: 2154.72 (c) NLL: 7320.78 


Figure 5.10: Different predictions (blue) generated by a GAN trained on the biwi:eth dataset, 
yielding a well-spread (a) and a bad (b) prediction, as well as a prediction indicating 
a case of mode collapse (c). The observed trajectory is depicted in red. The negative 
log-likelihood (NLL) is provided for each predicted distribution given the ground 
truth trajectory depicted in green. 


Figure 5.10a gives an example for a spread-out prediction with noticeable bias, 
which also covers the actual future trajectory. A common failure case is then 
given in Figure 5.10b, where the model generates a prediction with increasing 
uncertainty, but misses the actual future trajectory. Finally, 5.10c provides 
an example of a prediction, which indicates a case of mode collapse. In this 
example, all samples generated by the model are basically the same, with only 
minimal variation. While the failure case in Figure 5.10b yields a significantly 
increased negative log-likelihood, the low variance in the prediction depicted 
in 5.10c is increasing the error even further. 


5.1.5.2 Comparison of Probabilistic Predictions 


For a qualitative comparison of the probabilistic prediction models, three 
examples are taken from the sdd:hyang00 dataset, as it provides a well- 
structured scenery, where pedestrians mainly stay on designated walking 
paths. These examples are depicted in Figure 5.11 and cover a range of 
situations with an increasing number of distinguishable possibilities for 
future trajectories. 
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(a) (b) (e) 


Figure 5.11: Examples (observation depicted with markers) taken from the sdd:hyang00 dataset, 
providing examples for a straight prediction and multi-modal predictions at differ- 
ent decision points, i.e. junctions, on the pathway. 


In the example depicted in Figure 5.11a, prediction of a straight motion is to be 
expected, as there are no decision points after the observed portion of the tra- 
jectory. Example 5.11b provides an observed trajectory, which ends prior to a 
decision point, where the observed person can either move straight or turn to 
the right. Although, looking at the data, turning to the right is statistically less 
likely, the observed trajectory shows a tendency of moving to the right, mak- 
ing both options possible. At last, the example given in Figure 5.11c grants the 
possibility of a potential trimodal prediction. In this case, however, the num- 
ber of modes in the prediction is highly dependent on the local neighborhood 
of the observed trajectory considered during model training, as it influences 
the target distribution. This is discussed in more detail in Section 5.1.5.3. 


The predictions by each model for each example are depicted in Figures 5.12, 
5.13 and 5.14. For the R-MDN and N-MDN models, the R-MDN, and N- 
MDNg variants are used as representatives. Predictions are illustrated as a 
heatmap calculated from predicted samples of each time step t € {1,...,.Nprea}- 
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(a) R-MDN (b) VAE 


(c) GAN (d) N-MDN 


Figure 5.12: Predictions generated by the probabilistic prediction models in comparison for an 
example trajectory (red markers) taken from the sdd:hyang00 dataset. The actual 
future trajectory is indicated as a dashed red line. 


For the first example, all models, with the exception of the VAE, generate a 
unimodal prediction going straight, as expected. While the R-MDN and the 
GAN generate comparable results, the N-MDN generated a higher-variance 
prediction. The VAE on the other hand wrongly predicts another possibility 
of moving downwards in addition to the straight prediction. This may be 
caused by the close proximity of the absolute positions to the junction, where 
moving down is another option. This, in turn, indicates, that the model puts 
more weight on the observed positions in isolation, rather than to the context. 
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(c) GAN (d) N-MDN 


Figure 5.13: Predictions generated by the probabilistic prediction models in comparison for an 
example trajectory (red markers) taken from the sdd:hyang00 dataset. The actual 
future trajectory is indicated as a dashed red line. 


The second example shows another form of mode collapse in the predictions 
of the R-MDN and the GAN, where the statistically less relevant mode is sup- 
pressed, thus the model’s prediction collapses onto a single mode. Again, 
the N-MDN and VAE models generate predictions with a higher variance, 
thereby also covering the possibility of turning to the right. Still, it is visible 
from the heatmap, that moving straight is the dominant option. 


115 


5 Evaluation 


(c) GAN (d) N-MDN 


Figure 5.14: Predictions generated by the probabilistic prediction models in comparison for an 
example trajectory (red markers) taken from the sdd:hyang00 dataset. The actual 
future trajectory is indicated as a dashed red line. 


Consistent with the two previous examples, the R-MDN and GAN generate a 
similar bimodal prediction for the final example, both ignoring the possibil- 
ity of moving to the right. Ignoring this possibility could be attributed to the 
observed trajectory being close to the left side of the pathway, making it less 
likely moving to the right. Combined with the rather low-variance predic- 
tions of both models, hinting at a smaller surrounding area being considered 
for the conditional prediction, trajectories located closer to the right side of 
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the pathway could not have had much influence on the model’s output dur- 
ing training. Opposed to that, the VAE and N-MDN models output a trimodal 
prediction, where the third mode is more defined in the N-MDN’s prediction. 
At the same time, the VAE seems to over-estimate the pedestrian’s move- 
ment speed when going straight, while the N-MDN rather under-estimates 
it slightly, when compared to the R-MDN and GAN predictions. 


5.1.5.3 Assessing the Quality of Probabilistic Predictions 


In the quantitative evaluation section, the negative log-likelihood (NLL) has 
been used as a measure of the quality of probabilistic predictions generated 
by the R-MDN, VAE, GAN and N-MDN models. Although the NLL evalu- 
ates a predictive distribution generated for a given observation using only a 
single sample (the actual future trajectory), its application is justified under 
the assumption, that similar observations result in similar predictive distri- 
butions, thus evaluating the entire distribution. At the same time, wrong or 
superfluous modes in the predictive distribution are not penalized. This is 
one of the reasons for models, which generate distributions with higher vari- 
ance, are often scored better. This is also the case for the oracle measure, as it 
ignores all predictions that are not close to the ground truth [Paj18]. These dif- 
ficulties in assessing the quality of probabilistic predictions might be a reason 
for the standard evaluation approach for trajectory prediction models leaving 
out such a measure, even though most state-of-the-art models are capable of 
generating probabilistic predictions. Apart from these difficulties, the NLL 
provides a reliable measure for probabilistic predictions, as it does not require 
the actual ground truth distribution to be known. 


Nonetheless, it would be interesting to compare the probabilistic models 
under a more sophisticated measure, using an estimation of the conditional 
ground truth distribution. Thus, this section aims to provide a toy example 
on a real-world dataset for evaluating the R-MDN, VAE, GAN and N-MDN 
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using the Wasserstein distance [Kol17] 


1 
wp(P.Q)=( int. fi-a)", 6.10) 
where P and Q are probability distributions and T(P,Q) is the set of all 
joint distributions y(x, y) whose marginals are P and Q, respectively. The 
Wasserstein distance, originally formulated in the context of optimal trans- 
port [Kan39], is preferred over the KL-Divergence [Kul51] and metrics built 
upon it (e.g. the Jensen-Shannon distance [End03]), as it also takes the metric 
space into account. As such, it considers the work required to transport the 
probability mass from a given distribution to a target distribution. Because 
of this intuition, it is also known as the Earth Mover’s distance in the 1- 
dimensional case. For dimensions d > 1, there exists no closed form solution 
for the Wasserstein distance. In this case, a commonly used approximation 
is given by the sliced Wasserstein distance [Bon15, Kol19]. 


The following toy example focuses on the evaluation of the endpoint distri- 
bution PUN ire al)» Nprea Steps into the future, generated by each probabilistic 
prediction model, using the sliced Wasserstein distance. As a first step, the 
conditional ground truth distribution needs to be determined for each trajec- 
tory in the test dataset. This can be achieved by searching the training dataset 
for trajectories, which are similar to each test dataset trajectory in their re- 
spective observed portion. The conditional ground truth distribution can be 
estimated by applying a Gaussian kernel density estimation, using the end- 
points of similar training dataset trajectories. The steps required to determine 
the conditional ground truth distribution for an exemplary test dataset trajec- 
tory I, € Dies, (Figure 5.15a) are depicted in Figure 5.15. 
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(a) Exemplary test trajectory (b) Determined search box 


(c) Set of similar trajectories (d) Estimated endpoint distribution 


Figure 5.15: Example for estimating a conditional ground truth probability distribution of a tra- 
jectory endpoint given a test trajectory’s first 8 points as observation. The exem- 
plary test trajectory in this figure starts at the black circular marker. The observed 
portion of the exemplary test trajectory ends prior to the junction, making a prob- 
abilistic prediction of its true endpoint potentially multi-modal. 


For finding similar trajectories of X, = {x}, ..., Xv : 


rectangular search region around the test trajectory’s first point x] is deter- 


an axis-aligned 
+N prea } 8 


mined. While the longitudinal expansion en, is calculated to include the 
first 3 trajectory points, the lateral expansion e} considers the width of the 
walking path and is set by hand. This assumes, that there is no bias in the 
conditional ground truth distribution, if the observed trajectory is closer to 
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either side of the walking path. The resulting search region is depicted in 
Figure 5.15b. 


As a next step, all training dataset trajectories starting within this region are 
gathered. From this set of trajectories, only those complying with the general 
movement direction 


1 Nobs 
m* = xt — xt (5.11) 
N ps = t t-1 
and speed 
1 N obs 
s* = — ) |x? -x_,ll (5.12) 
obs ¢=2 


ofthe observed portion of the test trajectory are kept for the distribution es- 
timation. Here, a movement direction deviation Oqir of 10° and a speed devi- 
ation As of 25% is allowed. The resulting set of similar trajectories D5; is 
depicted in Figure 5.15c. Finally, the conditional endpoint distribution esti- 
mated from the set of similar trajectories is illustrated in Figure 5.15d. 


It has to be noted, that the resulting probability distribution is highly depen- 
dent on the considered local neighborhood defined by elong and eja and the 
deviation parameters 0,;, and As. At the same time, it is not quite clear how to 
choose these values properly. As such, the assumption made for this example 
might be inaccurate. Further, it is even harder to define these parameters in 
less structured datasets, making such an evaluation non-viable for large scale 
evaluations including several datasets. In addition to this aspect, another ob- 
stacle is the required amount of trajectories similar to an observed trajectory 
in question. This is touched upon in more detail later in this section. 


Aiming at a comparison of endpoint probability distributions, Figure 5.16 de- 
picts sample-based predictions for the endpoint as generated by the sample- 
based prediction models. 
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(a) R-MDN 


Figure 5.16: Sample-based endpoint predictions as generated by the R-MDN, VAE and GAN 
models. Regions of high sample-density can be interpreted as modes in the ac- 
tual probability distribution, whereas the sample-spread indicates the variance. 


As described before, a probability density function is estimated from these 
sample-based predictions by applying a Gaussian kernel density estimation. 
The resulting probability densities, including the one generated by the N- 
MDN, are depicted in Figure 5.17. In addition to the probability densities, 
respective NLL scores given X, and Wasserstein distances given the estimated 
ground truth distribution (see Figure 5.15d) are provided. 
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(a) R-MDN (NLL: 124.65, Wasserstein: (b) VAE (NLL: 16.61, Wasserstein: 
116.66) 127.24) 


(c) GAN (NLL: 200.31, Wasserstein: (d) N-MDN (NLL: 11.46, Wasserstein: 
144.21) 91.10) 


Figure 5.17: Predicted endpoint probability distributions as generated by the R-MDN, VAE, GAN 
and N-MDN models. The NLL and Wasserstein distance values between each re- 
spective predicted distribution and an estimated ground truth distribution (see Fig- 
ure 5.15d) are provided. 


Following this example on how to calculate the Wasserstein distance for an 
exemplary trajectory, the same methodology is applied on the first fold test 
dataset of the sdd:hyang00 dataset. For the evaluation on the whole dataset, 
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a few things have to be noted. First, in practice’, the Wasserstein distance is 
calculated on a sample-based representation of the provided probability dis- 
tributions. Thus, the actual sample-based representations are used for the 
R-MDN, VAE and GAN models, and an equal amount of samples is drawn 
from the distribution generated by the N-MDN. Opposed to that, the esti- 
mated ground truth distribution is not re-sampled in order to obtain a larger 
number of samples, as to not distort the actual distribution. Further, test tra- 
jectories X; € Des are only considered, if there are at least 30 similar tra- 
jectories available, i.e. |Ø m| 2 30, according to the aforementioned method. 
According to [Sil86], at least 19 samples are required in order to calculate 
an accurate estimation of a bivariate Gaussian density using a kernel density 
estimation. Due to the ground truth distributions in this evaluation poten- 
tially being multi-modal, the number of samples should be increased. At the 
same time, increasing the number of required samples potentially reduces the 
number of available test trajectories, when there are not enough similar tra- 
jectories available. Using |X,;,,| Z 30, the size of the test dataset is reduced by 
approximately 45%, thus providing a trade-off between obtaining an accurate 
ground truth distribution and a reasonable test set size. 


Following this, Table 5.13 depicts the mean Wasserstein distance calculated 
using the sdd:hyang00’s first fold test dataset. For comparison, the NLL, as 
calculated for the quantitative evaluation (see Section 5.1.4), is provided. For 
completeness, the Wasserstein distance is also provided for the shotgun base- 
line. 


* In the context of this thesis, the implementation provided by the Python Optimal Transport li- 
brary [Fla21] is used, which computes a Monte Carlo approximation of the 2-sliced Wasserstein 
distance. 
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Table 5.13: Negative Log-Likelihood and Wasserstein distance calculated on the sdd:hyang00’s 
first fold test dataset for each probabilistic prediction model in comparison. In case 
of the Wasserstein distance, the predicted endpoint distribution is compared with an 
estimation of the true endpoint distribution for each test trajectory. Lower is better 
for both measures. 


Model NLL Wasserstein 


Shotgun 14.97 68.04 
R-MDN 14.67 81.25 
VAE 9.41 62.68 
GAN 30.36 61.47 
N-MDN 9.12 51.62 


Looking at the results, the ranking of the probabilistic models is, in parts, 
consistent with the NLL-based ranking. The shotgun baseline still performed 
well in this toy example, which is probably due to the presence of many cases, 
where an unimodal prediction is sufficient. This also supports the shotgun 
approach’ viability as a baseline for probabilistic trajectory prediction. Be- 
sides that, a major difference is the GAN performing notably better under 
the Wasserstein distance, which is likely to be attributed to the Wasserstein 
distance not penalizing lower variance predictions as is the case for the NLL. 
Still, the N-MDN outperforms the other models in terms of the quality of the 
probabilistic prediction. The results further indicate being more stable under 
the use of the Wasserstein distance. 


In summary, this toy example supports the viability of the proposed N-Curve 
approach in the context of human trajectory prediction. Further, it is sug- 
gested, that the NLL can pose a viable performance measure for probabilistic 
prediction, but it needs to be accompanied with a qualitative evaluation in or- 
der to investigate on the reasonability of the predictions in terms of their vari- 
ance. Finally, in cases, where the ground truth data distribution is available, 
e.g. when using synthetically generated datasets, the Wasserstein distance 
may be preferred over the NLL. 
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5.1.5.4 N-MDN: Additional Examples 


This section focuses on few real-world examples, addressing different charac- 
teristics of the N-Curve model and its N-MDN implementation, namely the 
suppression of superfluous mixture components (see Section 4.1.3), modeling 
different speeds using multiple mixture components and the squeezing effect 
(see Section 3.1.2). In addition, the most common failure case occurring when 
using the N-MDN is presented. 


Starting off with superfluous component suppression, the prediction of a 3- 
component N-MDN for an exemplary trajectory taken from the biwi:eth is 
depicted in Figure 5.18. In this example, it can be seen, that in the model’s 
output, two components have been suppressed, by assigning them a weight 
of Tk © 0, leaving a single component for the prediction. This complies with 
the desired behavior in this situation, as all persons moving towards the uni- 
versity come together at the entrance. Additionally, all persons in the dataset 
move with the same speed on average, making multi-modal prediction only 
necessary in situations with multiple distinct possible future trajectories. 


Figure 5.18: Exemplary prediction of a 3-component N-MDN, where 2 N-Curves were sup- 
pressed in favor of a single N-Curve responsible for the prediction. 


125 


5 Evaluation 


On rare occasions, a low-weight, not well-optimized component appears in a 
generated prediction. This is the most common failure case when using the N- 
MDN and is closely connected to the presence of superfluous components. An 
example for this taken from the crowds:zara02 dataset is given in Figure 5.19. 
In this example, the component depicted in green unexpectedly branches out 
and reduces in speed greatly. While both incidents are valid under the pres- 
ence of other nearby pedestrians being part of a collision avoidance behavior, 
both actions combined are more likely to be an optimization artifact, where 
a mixture component receives no more support from training samples from 
some point onwards during training. Although the model is not exposed to 
specific multi-agent data, isolated trajectories still reflect this behavior. 


Figure 5.19: Exemplary prediction of a 3-component N-MDN revealing a common failure case 
of the prediction including a low-weighed not well-optimized mixture component 
(green). 


Besides using multiple mixture components in a prediction for modeling dis- 
tinct future trajectories, these components can also be used to model similar 
future trajectories, but at different speeds. Figure 5.20 gives an example taken 
from the crowds:zara02 dataset, where the N-MDN uses all of its 3 mixture 
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components for modeling different speed versions of the same future trajec- 
tory in terms of its movement direction and curvature. As mentioned before, 
having a higher person density than for example the biwi datasets, it may be 
more likely to deviate from the average movement speed, in order to prevent 
collisions with other pedestrians. This, in turn, likely causes the multi-modal 
prediction covering different speeds. 


Figure 5.20: Exemplary prediction of a 3-component N-MDN, where each N -Curves in the mix- 
ture models another version of the same trajectory, using a different movement 
speed. 


Finally, open questions concluding Section 3.1.2, include whether the squeez- 
ing effect is relevant in real-world situations and if the model is able to gen- 
erate constant variance predictions, when learned from data. Overall, the 
squeezing effect can be rated as not relevant when using real-world data, 
which is generally subject to noise. In this case, predicting into the future, the 
variance usually increases with each time step. With respect to the constant 
variance case, Figure 5.21 provides an example taken from the sdd:hyang00 
dataset, where the N-MDN outputs an unimodal prediction, which maintains 
almost constant lateral variance. In this example, this is achieved by slowly 
morphing an almost circular covariance ellipse towards a covariance ellipse 
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with increased longitudinal variance. Here, the increase of the longitudinal 
variance is a way of coping with uncertainties about the actual speed of the 
observed trajectory. Observing both, the effect of an increasing longitudinal 
variance over time while maintaining near constant lateral variance, show the 
low relevance of the squeezing effect for real-world applications. 


Figure 5.21: Exemplary prediction of a N-MDN maintaining a near constant lateral variance 
while increasing longitudinal variance at the same time. 


5.1.5.5 Implicit Input Attention 


With the prediction generated by an N-MDN being based on just the N „ps ob- 
served positions, this section investigates the influence of each input on the 
generated prediction, with respect to their position within the observed se- 
quence. Recall, that an N-MDN prediction is given in terms of a curve weight 
distribution 7, as well as a set of control point mean vectors u and covariance 
matrices X, which yield the predicted mean curve and the region of uncer- 
tainty around it. Following this, it is especially interesting to see, if different 
parts of a given input sequence are considered for generating 7, u and ÈX. 


128 


5.1 Long-term Human Trajectory Prediction 


As there is no attention mechanism explicitly built into the N-MDN archi- 
tecture, the model’s attention to different parts of a given sequence can be 
calculated using the gradient of each generated output with respect to the in- 
puts. Using PyTorch, its autograd module can be used for this, which calculates 
the respective gradients by performing a backward pass through the network 
given the generated output. Figure 5.22 depicts the resulting gradient-based 
implicit attention maps for each dataset in the evaluation. For each dataset, 
gradient magnitudes are averaged for each output, ie. 7, y and X. As in 
previous sections, the N-MDNg variant is considered. 


Figure 5.22a, 5.22b and 5.22c depict the input attention on a per dataset ba- 
sis for each of the model outputs separately. Figure 5.22a reveals that for 
generating the mean vectors 4, the most important input is given by the last 
observed position with an additional, but weaker, contribution by the second 
last element. This observation is in line with the findings given in [Sch20a]. 
Opposed to that, for determining the weights and covariance matrices, a mix 
of multiple observations spread across the entire observed sequence is con- 
sidered. The choice of which observations to rely on varies between datasets. 
This is most likely due to random effects during training and the model being 
trained for each dataset individually. Especially in the case of the covariance 
matrix, it makes sense to incorporate multiple observations from a given se- 
quence, as a noise estimation can be expected to be more accurate using more 
data samples. Figure 5.22c depicts the input attention for each model output 
averaged over all datasets and summarizes the aforementioned findings. 
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Datasets 
Datasets 


(a) Mean vectors u (b) Covariance matrices & 


Datasets 


(d) Overall average 


(c) Weighting 77 


Figure 5.22: Heatmap visualization of the models attention to each of the Nobs = 8 observa- 
tions when generating predictions in terms of curve weights 77, mean vectors u 
and covariance matrices X. Time steps t are given relative to the last observation 
at t = 0. Inputs with no influence on the output are depicted in white and inputs 
with the most influence (per row) are given in dark blue. In figures (a) - (c), the 
datasets biwi:eth, biwi:hotel, crowds:zara01, crowds:zara02, sdd:bookstore03 and 


sdd:hyang00 are depicted along the y-axis (a - f). 


To accompany the heatmap visualizations, Figure 5.23 depicts the influence 
of each observation on respective mean vectors and covariance matrices for 
two exemplary input sequences. Both examples support the observation, that 
for generating the mean vectors, the most recent observations are the most 
important. Further, the covariance matrices are determined using several ob- 
servations spread across the observed sequence. 
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(b) 


Figure 5.23: Visualization of the influence of observations along a given input sequence for two 
exemplary sequences. More important observations for determining the mean vec- 
tors (left) and covariance matrices (right) are depicted with higher color intensity. 


5.1.6 Summary 


In summary, this section gave a detailed overview of the human trajectory 
prediction task, commonly used datasets and state-of-the-art prediction mod- 
els. The latter are most commonly variants of Recurrent Mixture Density 
Networks, Variational Autoencoders and Generative Adversarial Networks. 
This overview was followed by an extensive evaluation of the N-Curve model 
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in comparison to these commonly used models, using different performance 
measures and corresponding baselines. The performance measures include 
the average and final displacement error for measuring the performance of 
maximum likelihood predictions, as well as the negative log likelihood for 
assessing the probabilistic prediction performance. In this evaluation, the N- 
Curve model shows competitive results, outperforming most other generic 
probabilistic sequence models in the comparison. 


5.2 Human Motion Prediction 


The primary goal of this section is the evaluation of the scalability of the N- 
Curve model to higher-dimensional data. For this, the task of human motion 
prediction is considered. Note that in the literature, human trajectory predic- 
tion (see the previous Section 5.1) is sometimes confused with human motion 
prediction. To clarify, human trajectory prediction is concerned with human 
movement along a trajectory through an observed scene based on observed 
2- or 3-dimensional locations. Opposed to that, human motion prediction tar- 
gets the motion of the human body when performing different actions and is 
based on sequences of human poses. 


Thus, in human motion prediction, a prediction model is tasked to generate a 
sequence of human poses resembling some action performed by an observed 
subject. The generation is thereby conditioned on a given initial observation 
of the performed action. Each element in the sequences to process is given by 
a human pose. Such human poses are commonly represented as a set of 3D 
joint positions, which can be connected via a skeleton definition. The number 
of 3D joints describing a human pose varies between datasets. To give an 
example, in the Human3.6m dataset [Ion13], a human pose is described by 32 
3D joints, yielding a 96-dimensional vector. 
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5.2.1 Datasets 


Looking at datasets which provide human pose sequences, the most com- 
monly used ones include the CMU mocap database’, the Human3.6m dataset 
(abbrev.: h3.6m, [Ion13]) and the NTU RGB+D dataset [Sha16]. Among these 
datasets, for the task of human motion prediction the h3.6m dataset is the 
most widely used. This is due to existence of a standard evaluation protocol, 
allowing to re-use previous results of different approaches tackling the pre- 
diction task. Some details on the h3.6m dataset are given in Table 5.14. Figure 
5.24 depicts an example of a pose sequence for the walking action taken from 
the h3.6m dataset. 


Table 5.14: Details of the Human3.6m dataset. 


The Human3.6m dataset 


Number of subjects 11 

Pose representation 3D joint positions and angles (32 joints) 

Recorded actions directions, discussion, eating, seated activities, greeting, tak- 
ing photo, posing, making purchases, smoking, waiting, 
walking, sitting on chair, talking on the phone, walking dog, 
walking together 

Sample rate 50Hz 


j | 


RAGS! 


Figure 5.24: Exemplary human motion sequence taken from the Human3.6m dataset. The left 
arm and leg are depicted in blue. For illustration purposes, the sample rate is re- 
duced to 12.5Hz. 


* http://mocap.cs.cmu.edu/ 
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5.2.2 Evaluation Protocol 


For enabling a repeatable and comparable evaluation, approaches presented in 
the literature commonly follow the standard evaluation protocol provided in 
[Fra15] and [Jai16]. According to this protocol, multiple data pre-processing 
steps are performed prior to training and evaluation. First, the pose represen- 
tation as provided in the h3.6m dataset is converted into an exponential map 
representation of each joint using a specific pre-processing of global transla- 
tion and rotation as specified in [Tay07]. Following the change in representa- 
tion, the data is standardized by subtraction of the mean and division by the 
standard deviation along each dimension. Then, dimensions with constant 
values are dropped from the representation. The resulting pose representa- 
tion then consists of 17 joints and a global translation component, yielding a 
54-dimensional representation. Finally, the sequence sample rate is reduced 
to 25Hz. 


Using the pre-processed data, training is performed on a subset of actions us- 
ing subjects S1, S6, S7, S8, S9 and S11. The action subset includes walking, eat- 
ing, smoking, discussion, directions, greeting, phoning, posing, purchases, sitting, 
sittingdown, takingphoto, waiting, walkingdog, walkingtogether. The test data- 
set then contains actions performed by subject S5, collecting 8 sub-sequences 
of specific actions using a fixed seed. The considered set of actions in the 
test dataset is restricted to the representative actions walking, eating, smok- 
ing and discussion. For prediction, a given model is tasked to predict up to 
Npred = 10 time steps (400 milliseconds) into the future, given an observa- 
tion of Nops = 50 time steps (2 seconds) of a given action. The prediction 


performance is then measured in terms of the mean angle error" 


1“ 
MAE = = Dll - ville. (5.13) 


* Using an euler angle representation, which can be calculated from the exponential map 
representation 
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calculated after 80, 160, 320 and 400 milliseconds, using M samples of the 
same action. With a sample rate of 25Hz, this corresponds to t = 2, t = 4, 
t = 8 and t = 10 time steps. This restriction to short-term prediction’ is 
introduced due to the stochasticity of human motion preventing a quantitative 
evaluation of longer time horizons [Fra15]. 


5.2.3 Baselines and Comparison Models 


For comparison, there are several commonly used simple and neural network- 
based baselines. Common simple baselines are given by the Zero-velocity 
model [Mar17], which constantly predicts the last observation, and a run- 
ning average approach of the last n observed poses. The running average ap- 
proach will be abbreviated as Run. avg. n. Regarding neural network-based 
baselines, the most prevalent models include the LSTM-3LR [Fra15], the ERD 
[Fra15] and the SRNN [Jail6] models. While the LSTM-3LR is a three-layer 
LSTM network, the ERD and SRNN models are more tailored towards learning 
a meaningful representation of a given observation to base their prediction 
on. The Encoder-Recurrent-Decoder model (abbrev.: ERD) is a type of RNN 
that combines representation learning with learning temporal dynamics. To 
achieve this, the input to the RNN is encoded into a representation, where 
learning pose dynamics is easier. The Structural RNN (abbrev.: SRNN) on the 
other hand aims to incorporate semantic knowledge about the data structure 
into the model architecture. Following the fact, that a sequence of poses can 
be represented by a (manually designed) spatio-temporal graph, the SRNN 
provides an approach for transforming such a graph into a feedforward mix- 
ture of RNNs. 


Beyond these common baselines, recent approaches to human motion pre- 
diction are commonly based on either Recurrent Neural Networks (e.g. 
[Gho17, Gop19]), (sequence-to-sequence) Generative Adversarial Networks 
(e.g. [Guil8, Kun19]) or Graph Neural Networks (abbrev.: GNN, e.g. [Mao19, 


* Short-term prediction is defined as predicting less than 560ms into the future. 
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Li20]). The latter thereby consider the actual configuration of the joints ac- 
cording to the underlying skeleton. For the following quantitative evaluation, 
representatives for each base architecture are selected. 


In the group of Recurrent Neural Networks, besides the baselines presented 
above, another interesting approach is given by the QuaterNet model [Pav18]. 
As opposed to the other approaches in this comparison, which regress joint ro- 
tations using the exponential map representation, the QuaterNet model uses 
a quaternion-based representation of joint rotations. This change in repre- 
sentation targets the issue of discontinuities, which can occur when using 
an exponential map representation. Further, joint position errors are consid- 
ered in the training loss function, trying to incorporate the varying impact 
of joints on the pose. 


Looking at the GAN-based approaches, the Adversarial Geometry-aware 
Encoder-Decoder (abbrev.: AGED, [Guil8]) and Bidirectional 3D Human Mo- 
tion Prediction GAN (abbrev.: BIHMP-GAN [Kun19]) models are considered 
in the quantitative evaluation. Both of these models rely on a seq2seq RNN 
which is embedded in an adversarial training approach. The BIHMP-GAN 
model, on the one hand, incorporates a pose embedding, comparable to the 
ERD, and uses a bidirectional RNN architecture [Sch97] in its discriminator 
network. On the other hand, the AGED model exploits the intrinsic geometric 
structure of 3D rotations during training of the generator, by using a geodesic 
distance between joint rotations. This is opposed to the common approach of 
using an euclidean distance between predicted and ground truth joint angles. 


Finally, among GNN-based approaches, the Traj-GCN [Mao19] and the Ad- 
versarial GCN (abbrev.: A-GCN, [Cui20]) models are included in the evalua- 
tion. Both models are based on Graph Convolutional Networks (abbrev.: GCN, 
[Kip17]) and thus encode spatial dependencies in human poses by treating a 
pose as a generic graph. The Traj-GCN model proposes to work in trajectory 
space instead of the traditionally used pose space, in order to encode tem- 
poral information. Further, graph connectivity is learned automatically dur- 
ing training. Similar to the second aspect, the A-GCN learns the connection 
strength between nodes in the graph. Following this, poses are represented as 
a dynamic graph, where natural connections between joint pairs are exploited 
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explicitly. Beyond that, links between geometrically separated joints can be 
learned implicitly. Using an adversarial training approach, the A-GCN could 
also be put into the group of GAN-based approaches, thus blurring the line 
between the groups. An overview of the presented baseline and comparison 
models is depicted in Table 5.15. 


Table 5.15: Overview of the baseline and comparison models considered in this evaluation. 


Model Type Year 
Zero-velocity [Mar17] Simple baseline 2017 


Run. avg. n [Mar17] Simple baseline 2017 


LSTM-3LR [Fra15] Neural baseline 2015 
ERD [Fra15] Neural baseline 2015 
SRNN [Jai16] Neural baseline 2016 
QuaterNet [Pav18] RNN 2018 
AGED [Gui18] GAN 2017 
BiHMP-GAN [Kun19] GAN 2019 
Traj-GCN [Mao19] GNN 2019 
A-GCN [Cui20] GNN 2020 


5.2.4 N-Curve Model Setup 


Following the common approach of processing pose sequences in an exponen- 
tial map representation, training and prediction in the N-MDN will be based 
on this representation. With a focus on scalability, the generic version of the 
model is used as in the human trajectory prediction evaluation (Section 5.1). 
Therefore, model extensions tailoring the model towards the task of human 
motion prediction are disregarded. Further, the use of a more domain-specific 
loss function, i.e. the geodesic loss function proposed in [Guil8], is also not 
considered. This is due to the fact, that it cannot be easily integrated into 
the log-likelihood loss function for learning the mean vectors and covariance 
matrices jointly. 


For the evaluation, two variants of the N-MDN generating unimodal predic- 
tions are employed. For the first variant, denoted as N-MDNg, the hyperpa- 
rameters (see also Figure 5.5) are set as dene = 1024, Nouryes = 1 and Nes = 4. 
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A 4-layer LSTM is used as the sequence encoder. Further, the N-MDN is 
parameterized to generate diagonal covariance matrices only. This is com- 
mon practice due to covariance estimation becoming more difficult in higher- 
dimensional data [Ha18, Raz20]. Mixture Density Networks are especially 
afflicted by this, where the estimation of higher-dimensional covariance ma- 
trices contributes to numerical instabilities [Rup17, Mak19]. Still, with the V- 
MDN processing full pose representations, it can be expected that dependen- 
cies between dimensions are captured implicitly, regardless of the generated 
N-Curve only providing diagonal covariance matrices. In order to provide 
more expressive covariance matrices, an additional variant of the N-MDN is 
evaluated. This variant is denoted as N-MDN, and generates N-Curves with 
sparse covariance matrices, which model inter-joint correlations and the cor- 
relations between the dimensions of the global translation. With each joint 
and the global translation being represented by a 3-dimensional (sub-)vector 
within the pose representation, resulting covariance matrices consist of 18 
3x3 block matrices. To prevent numerical instabilities, N-MDN; is realized 
as an ensemble of N-MDNs, where each network models the 3 dimensions 
of the global translation or a single joint, respectively. The outputs of each 
network in the ensemble are then combined into the targeted 54-dimensional 
N-Curve. In this case, all joints are now modeled independently. Each N- 
MDN in the ensemble is parameterized with dene = 128, Nayyes = 1 and 
Nepyts = 4, using a 1-layer LSTM as sequence encoder. 


5.2.5 Quantitative Results 


This section provides the quantitative results of the N-MDN variants and the 
comparison models on the test dataset according to the standard protocol. 
The results for the simple and neural baselines are taken from [Mar17]. For 
the comparison models, the results are gathered from their respective papers. 
Thereby, only the overall best performing model variant, if there are any, is 
considered. The joint angle errors are reported in Tables 5.16 and 5.17. It 
should be noted, that the error standard deviation is commonly not reported 
in the literature, thus the standard deviation is left out for all models in com- 
parison. 
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Table 5.16: Mean angle error (lower is better) for short-term human motion prediction on the 
Human3.6m dataset for the representative actions walking and eating. Commonly 
used simple and neural baselines are provided at the top and recent domain-specific 
models in the middle. 


walking eating 
milliseconds 80 160 320 400 | 80 160 320 400 
Zero-velocity 0.39 0.68 0.99 1.15 | 0.27 048 0.73 0.86 
Run.avg.2 048 074 102 117 | 0.32 052 0.74 0.87 
LSTM-3LR 0.77 1.00 1.29 1.47 | 0.89 1.09 1.35 1.46 
ERD 0.93 1.18 1.59 1.78 | 1.27 145 1.66 1.80 
SRNN 0.81 0.94 1.16 1.30 | 0.97 114 135 1.46 
" Quater Net 0.21 0.34 056 0.62 | 0.20 035 058 0.70 
AGED 022 036 055 0.67 | 0.17 028 051 0.64 
BiHMP-GAN 033 0.52 0.63 0.67 | 0.20 033 0.54 0.70 
Traj-GCN 0.18 0.31 049 0.56 | 0.16 0.29 0.50 0.62 
A-GCN 0.16 0.29 046 0.57 | 0.16 027 0.49 0.64 
O N-MDNa 025 041 064 07 | 0.22 036 061 075 
N-MDNp 0.21 0.35 0.60 0.72 | 0.20 0.35 0.59 0.72 
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Table 5.17: Mean angle error (lower is better) for short-term human motion prediction on the 
Human3.6m dataset for the representative actions smoking and discussion. Com- 
monly used simple and neural baselines are provided at the top and recent domain- 
specific models in the middle. 


smoking discussion 
milliseconds 80 160 320 400 | 80 160 320 400 
Zero-velocity 0.26 048 0.97 0.95 | 0.31 0.67 0.94 1.04 
Run. avg.2 030 052 0.99 0.97 | 0.41 0.74 099 1.09 
LSTM-3LR 1.34 165 204 216 | 188 212 225 2.23 
ERD 1.66 1.95 235 242 | 227 247 2.68 2.76 
SRNN 145 1.68 1.94 2.08 | 122 149 183 1.93 
“QuaterNet 0.25 047 0.93 0.90 | 0.26 0.60 0.85 0.93 
AGED 027 043 0.82 0.84 | 027 056 0.76 0.83 
BiHMP-GAN 0.26 0.50 0.91 0.86 | 0.33 0.65 0.91 1.00 
Traj-GCN 022 0.41 0.86 0.80 | 0.20 051 0.77 0.85 
A-GCN 0.20 038 0.79 0.82 0.19 045 072 0.81 
_N-MDN, 026 049 0.95 0.93 |030 065 09 104 
N-MDN, 0.26 048 0.91 0.91 | 0.33 0.71 099 1.04 


Looking at the results, the N-MDN variants generally outperform the simple, 
yet strong, baselines in this task. The neural baselines, which are themselves 
more generic models, similar to the N-MDN, are outperformed by a large 
margin. Expectedly, being a generic model, the N-MDN falls a little bit behind 
in comparison with the domain-specific models. 


Comparing both variants of the N-MDN, the results are very similar. Variant 
a performs slightly better on the discussion action, whereas variant b per- 
forms slightly better on the other actions. Differences between the predic- 
tions generated by both variants are further detailed in the qualitative results 
Section 5.2.6. 


In summary, the N-Curve models performs quite well on the given task, de- 
spite being a more generic probabilistic sequence model. As such, the model is 
not specifically built to capture the underlying tree-like structure of the data, 
nor does it employ a specialized loss function. An additional culprit contribut- 
ing to less accurate predictions may be given by the smoothing behavior of 
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the model, which is examined in more detail in Section 5.2.6. Finally, the 
quantitative evaluation shows that the model scales well to modeling higher- 
dimensional data. 


5.2.6 Qualitative Results 


For the qualitative evaluation, exemplary predictions generated by the N- 
MDN variants are examined. Thereby, differences between both variants and 
some insight into the behavior of the model is provided. Following this, Fig- 
ures 5.25 — 5.28 depict exemplary predictions for all four actions in the test 
dataset for both model variants. 
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Figure 5.25: Qualitative comparison of predictions generated by both variants of the V-MDN on 
the discussion action. N-MDNg generates diagonal covariance matrices. N-MDNp 
generates sparse covariance matrices, which model inter-joint correlations. For 
illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the 
last 2 observed poses are depicted together with an prediction of 320 milliseconds 
(4 time steps) into the future. The full ground truth sequence of poses is depicted 
in the first row. The left arm and leg are depicted in blue (ground truth) or purple 
(prediction), respectively. Regions of interest are highlighted. 
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Looking at the discussion action depicted in Figure 5.25, a noticeable difference 
between both N-MDN variants can be observed looking at the movement 
of the left arm. While N-MDN, predicts a downward movement, N-MDN, 
generates a more accurate prediction. Apart from that, both variants generate 
the same wrong movement for the right arm, indicating that the actual motion 
deviates from the average motion considering similar cases. 


Ground AS N 
Truth ` N 
/ )/ 

K IN 


\ N 
x y 


> > zz 


Figure 5.26: Qualitative comparison of predictions generated by both variants of the N-MDN 
on the eating action. N-MDNg generates diagonal covariance matrices. N-MDNp 
generates sparse covariance matrices, which model inter-joint correlations. For 
illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the 
last 2 observed poses are depicted together with an prediction of 320 milliseconds 
(4 time steps) into the future. The full ground truth sequence of poses is depicted 
in the first row. The left arm and leg are depicted in blue (ground truth) or purple 
(prediction), respectively. Regions of interest are highlighted. 
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Figure 5.27: Qualitative comparison of predictions generated by both variants of the V-MDN on 
the smoking action. N-MDN,„ generates diagonal covariance matrices. N-MDNp 
generates sparse covariance matrices, which model inter-joint correlations. For 
illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the 
last 2 observed poses are depicted together with an prediction of 320 milliseconds 
(4 time steps) into the future. The full ground truth sequence of poses is depicted 
in the first row. The left arm and leg are depicted in blue (ground truth) or purple 
(prediction), respectively. Regions of interest are highlighted. 


The actions eating (Figure 5.26) and smoking (Figure 5.27) both show, apart 
from a few joints, a static pose throughout the sequence. As such, only subtle 


movements can be observed looking at the left arm. With respect to the pre- 


dictions generated by both N-MDN variants, these movements are seemingly 


averaged out in some way and thus not captured by the model. This smooth- 


ing effect is more visible when looking at single dimensions of the pose rep- 


resentations, as depicted in Figure 5.31 towards the end of this section. 
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Figure 5.28: Qualitative comparison of predictions generated by both variants of the V-MDN on 
the walking action. N-MDNg generates diagonal covariance matrices. N-MDNp 
generates sparse covariance matrices, which model inter-joint correlations. For 
illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the 
last 2 observed poses are depicted together with an prediction of 320 milliseconds 
(4 time steps) into the future. The full ground truth sequence of poses is depicted 
in the first row. The left arm and leg are depicted in blue (ground truth) or purple 
(prediction), respectively. 


The action, which yields the most accurate prediction, is given by the walking 
action depicted in Figure 5.28. This is most likely due to this action consisting 
of more obvious motion of the entire body. Further, the walking action is 
more periodic than for example the discussion action. As such, it is more 
predictable and thus easier to model using a statistical model. In the given 
example, the observed subject slowly turns to the right. This is also correctly 
captured by both N-MDN variants. Besides that, it can be seen that both 
variants capture the general trend in motion, but the predicted motion is not 
as nuanced and pronounced as the actual motion. This is, again, most likely 
due to the smoothing property of the model. 


In order to gain more insight into the predictions generated by the N-MDN 
variants, selected dimensions of the pose representation are depicted in the 
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following. In this case, the mean prediction with corresponding standard de- 
viation is provided. The standard deviation can be obtained via marginaliza- 
tion from the covariance matrix at each predicted time step. 
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Figure 5.29: Visualization of selected pose representation dimensions for the purpose of illus- 
trating differences and similarities between N-MDN variants. The green curve de- 
picts N-MDNg and the blue curve depicts N-MDN;. For both curves, the o region 
around the curve is indicated by a shaded region. The ground truth is depicted in 
red. Time steps are depicted along the x axis and the unit-less value U of the se- 
lected dimension is given on the y axis. 


As mentioned before, there is a noticeable difference in the predicted motion 
of the left arm for the discussion action, when comparing both N-MDN vari- 
ants (see Figure 5.25). This can be seen looking at the third dimension of the 
representation of the left wrist (see Figure 5.29a). While the N-MDN, variant 
(blue) follows the ground truth, the N-MDN, variant (green) falsely predicts 
an almost constant value. With respect to the subtle arm movements in the 
smoking action (see Figure 5.27), it can be seen, that both model variants pre- 
dict almost constant values for the left wrist, whereas the ground truth slightly 
deviates from the constant prediction (see Figure 5.29b). 
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Figure 5.30: Visualization of selected pose representation dimensions for the purpose of illus- 
trating the capability of the N-MDN capturing general trends in human motion. 
The green curve depicts N-MDN, and the blue curve depicts N-MDNp. For both 
curves, the o region around the curve is indicated by a shaded region. The ground 
truth is depicted in red. Time steps are depicted along the x axis and the unit-less 
value U of the selected dimension is given on the y axis. 


Although the N-MDN is not quite well-suited for capturing subtle motions in 
a sequence of poses, it is well capable of capturing the general motion of an 
observed subject. This can be seen in Figure 5.30. Here, exemplary pose rep- 
resentation dimensions taken from the discussion and walking examples are 
illustrated. In both cases, the N-MDN variants generate N-Curves following 
the correct trend with respect to the ground truth. 


Finally, the innate smoothing feature of the N-Curve model is quite notice- 
able looking at the predictions generated by the N-MDN variants. By gen- 
erating a compact representation, the N-Curve model generally averages out 
small variations in the data and thus primarily captures trends in the data. 
The model thereby copes with small variations by varying the variance of the 
control points accordingly. This smoothing effect is depicted in Figure 5.31. 
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Figure 5.31: Visualization of selected pose representation dimensions for the purpose of illus- 
trating the smoothing property of the N-Curve model. The green curve depicts 
N-MDNg and the blue curve depicts N-MDNp. For both curves, the o region 
around the curve is indicated by a shaded region. The ground truth is depicted in 
red. Time steps are depicted along the x axis and the unit-less value v of the se- 
lected dimension is given on the y axis. 


On a final note, the N-MDN, variant generally generates higher variances 
than the N-MDNy, variant. This may be due to N-MDN, having to cope 
with larger variations in the data, as it processes full 54-dimensional pose 
representations, whereas networks within the N-MDN;, ensemble only need 
to deal with 3-dimensional data. 


5.2.7. Summary 


In this section, the scalability of the N-Curve model in terms of data dimen- 
sionality was evaluated. For this, the task of human motion prediction, where 
sequences of high-dimensional pose representations have to be modeled, was 
considered. The results show, that the N-Curve model is well-capable of rep- 
resenting higher-dimensional data by increasing the dimensionality of the 
stochastic control points accordingly. While the N-Curve model outperforms 
common baselines on the task, it falls a little bit behind in comparison with 
recent domain-specific models. However, this was expected, as the N-Curve 
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model is a generic model, while the domain-specific models incorporate addi- 
tional information about the data, such as the arrangement of joints by using 


graph networks. 
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Throughout this thesis, an approach for modeling stochastic processes with 
bounded index sets, the N-Curve model, based on a probabilistic extension of 
Bézier curves (N-Curves) has been presented. Thereby, a stochastic process 
is defined by Gaussian mixture distributions, which evolve along a mixture 
of N-Curves. By basing the N-Curve model on Bézier curves, a compact 
representation of a stochastic process can be achieved. Together with its pro- 
posed implementation based on Mixture Density Networks, the model pro- 
vides a fully regression-based approach to probabilistic sequence modeling, 
which does not rely on Monte Carlo techniques during inference, thus reach- 
ing set goals. By using parametric curves and optimizing in function space 
rather than the d-dimensional space of sequence values, the proposed model 
is able to generate smooth continuous predictions in a single inference step. 
Thereby, learning a probability distribution over parametric curves is in line 
with Gaussian processes, which the underlying N-Curves provide a special 
case for. Different properties of the model were examined by conducting sev- 
eral toy examples on synthetically generated data. 


The model has been evaluated extensively on the task of human trajectory 
prediction, targeting the overall performance of the model in an application 
context, which proved the viability and capabilities of the model. Looking at 
the evaluation results, the N-Curve model outperforms other generic proba- 
bilistic sequence models on different error measures capturing unimodal and 
multi-modal prediction performance. These models are commonly used as a 
basis for more sophisticated, domain-specific models. Further, difficulties in 
measuring multi-modal prediction performance were discussed. In the scope 
of this discussion, a small experiment was conducted, in which the application 
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of the Wasserstein metric as a performance measure was proposed. In addi- 
tion to this broader evaluation, the model’s scalability to higher-dimensional 
data has been shown by applying it to a human motion prediction task. While 
the N-Curve model outperformed common simple and neural network-based 
baselines, being a more generic model, it generated slightly less accurate pre- 
dictions in comparison to recent domain-specific models. Beyond the scala- 
bility assessment, difficulties in covariance estimation in higher dimensions 
and the smoothing property of the N-Curve model were discussed. 


Finally, extending on the concept of N-Curves, a conceptual extension to the 
model, which is capable of modeling infinite stochastic processes, has been 
presented. For this extension, denoted as the meta-time N-Curve model, a 
proof of concept on synthetically generated data has been provided, showing 
the overall viability of the approach in specific cases. 
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This final chapter provides an overview of possible directions for future re- 
search building on the findings of this thesis. 


7.1 Tackle Practical Limitations 


First of all, revealed practical limitations of the N-Curve model could be tack- 
led. Thereby, the most relevant limitations can be given by covariance es- 
timation in higher dimensions and the assumed stochastic independence of 
N-Curve control points. 


Covariance Estimation: As indicated in Section 5.2, estimating covariance ma- 
trices in high-dimensional data oftentimes leads to numerical instabilities dur- 
ing model training. This is mainly due to the increasing number of correla- 
tions that have to be estimated and the necessary condition of covariance ma- 
trices to be positive definite. As a result, oftentimes only diagonal covariance 
matrices are employed. A first step towards tackling this problem was taken 
by targeting sparse covariance matrices, in which only a few dimensions were 
correlated. Beyond that, it might be interesting to investigate more advanced 
approaches to covariance estimation (e.g. [Zho11, Che19]) and their applica- 
bility to training Mixture Density Networks. 


Stochastic Dependencies: For deriving a closed-form loss function for the (re- 
current) N-MDN implementation, independence of N-Curve control points 
was assumed (see Section 4.1). This independence can be sub-optimal when 
using the N-Curve model as a generative model (see Section 3.1.3), as Bézier 
curves sampled from an N-Curve not necessarily have a shape similar to the 
mean curve. This can be obstructive when N-Curves, estimated from some 
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dataset, should be used to enrich the dataset with more, synthetically gen- 
erated, sequences similar to those present in the dataset. Following this, it 
could be interesting to examine, how stochastic dependencies between con- 
trol points can be incorporated into the model. 


7.2 Model Extensions 


Apart from these practical challenges, several model extensions could be ap- 
proached, targeting different parts of the presented model. 


Interpolation of Arbitrary Distributions: In the current formulation, the N- 
Curve model interpolates Gaussian control points in order to obtain a se- 
quence of Gaussian curve points. Following this, the question arises if it 
would be possible to interpolate control points following arbitrary probability 
distributions, in order to obtain a probabilistic curve with curve points then 
following a combined arbitrary probability distribution. To achieve this, the 
operation of combining multiple control points would need to be extended to 
a more abstract or general concept, which allows the transformation of given 
probability distributions. Moving towards this goal, a possible relevant ap- 
proach might be given by Normalizing Flows (see Section 2.2.3), which can be 
used to transform simple probability distributions into more complex distri- 
bution by applying a chain of invertible mappings. 


The Meta-time N -Curve model: In the scope of this thesis, the meta-time N- 
Curve model (see Section 3.2) was introduced as a conceptual extension to 
the N-Curve model, lifting some less application-relevant limitations of the 
original model. However, the toy examples provided in Section 4.2 suggest 
the viability of the meta-time N-Curve model, especially for modeling long 
sequences or specifically structured sequential data. Following this, it would 
be interesting to further explore the capabilities of this model. Looking at 
the timeline mapping functions introduced in the model definition, it could 
be especially interesting to examine the possibilities granted by employing 
learned mapping functions. 
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Alternative Formulation for Handling Multi-Modality: Currently, the N- 
Curve model uses a mixture distribution approach for modeling multi-modal 
stochastic processes. A downside of such an approach is given by the po- 
tential blurring of modes when estimating the mixture parameters, as the 
loss is calculated in terms of a linear combination of all K mixture com- 
ponents. Although mode collapse is mitigated by using Bézier curves as a 
basis, less well-defined modes can still be a result of using a mixture distri- 
bution approach. Thus, in order to achieve more clearly separated modes, 
more emphasize could be put on the component selection by introducing a 
notion of attention [Vas17, Dai19] into the model. In this case, an attention 
mechanism could be used to decide which of the K available N-Curves to 
select or combine for a given input. 


N -Curve Gaussian Processes: Finally, it could be interesting to elaborate more 
on the properties and potential advantages and disadvantages of the class of 
Gaussian process kernels induces by an N-Curve in comparison with other 
kernels. Additionally, it could be examined if and to what extend the N- 
Curve model and its implementation would benefit from the incorporation of 
concepts taken from Gaussian processes. 
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